how to make research valid and reliable

Validity & Reliability In Research

A Plain-Language Explanation (With Examples)

By: Derek Jansen (MBA) | Expert Reviewer: Kerryn Warren (PhD) | September 2023

Validity and reliability are two related but distinctly different concepts within research. Understanding what they are and how to achieve them is critically important to any research project. In this post, we’ll unpack these two concepts as simply as possible.

This post is based on our popular online course, Research Methodology Bootcamp . In the course, we unpack the basics of methodology using straightfoward language and loads of examples. If you’re new to academic research, you definitely want to use this link to get 50% off the course (limited-time offer).

Overview: Validity & Reliability

The big picture
Validity 101
Reliability 101
Key takeaways

First, The Basics…

First, let’s start with a big-picture view and then we can zoom in to the finer details.

Validity and reliability are two incredibly important concepts in research, especially within the social sciences. Both validity and reliability have to do with the measurement of variables and/or constructs – for example, job satisfaction, intelligence, productivity, etc. When undertaking research, you’ll often want to measure these types of constructs and variables and, at the simplest level, validity and reliability are about ensuring the quality and accuracy of those measurements .

As you can probably imagine, if your measurements aren’t accurate or there are quality issues at play when you’re collecting your data, your entire study will be at risk. Therefore, validity and reliability are very important concepts to understand (and to get right). So, let’s unpack each of them.

What Is Validity?

In simple terms, validity (also called “construct validity”) is all about whether a research instrument accurately measures what it’s supposed to measure .

For example, let’s say you have a set of Likert scales that are supposed to quantify someone’s level of overall job satisfaction. If this set of scales focused purely on only one dimension of job satisfaction, say pay satisfaction, this would not be a valid measurement, as it only captures one aspect of the multidimensional construct. In other words, pay satisfaction alone is only one contributing factor toward overall job satisfaction, and therefore it’s not a valid way to measure someone’s job satisfaction.

Oftentimes in quantitative studies, the way in which the researcher or survey designer interprets a question or statement can differ from how the study participants interpret it . Given that respondents don’t have the opportunity to ask clarifying questions when taking a survey, it’s easy for these sorts of misunderstandings to crop up. Naturally, if the respondents are interpreting the question in the wrong way, the data they provide will be pretty useless . Therefore, ensuring that a study’s measurement instruments are valid – in other words, that they are measuring what they intend to measure – is incredibly important.

There are various types of validity and we’re not going to go down that rabbit hole in this post, but it’s worth quickly highlighting the importance of making sure that your research instrument is tightly aligned with the theoretical construct you’re trying to measure . In other words, you need to pay careful attention to how the key theories within your study define the thing you’re trying to measure – and then make sure that your survey presents it in the same way.

For example, sticking with the “job satisfaction” construct we looked at earlier, you’d need to clearly define what you mean by job satisfaction within your study (and this definition would of course need to be underpinned by the relevant theory). You’d then need to make sure that your chosen definition is reflected in the types of questions or scales you’re using in your survey . Simply put, you need to make sure that your survey respondents are perceiving your key constructs in the same way you are. Or, even if they’re not, that your measurement instrument is capturing the necessary information that reflects your definition of the construct at hand.

If all of this talk about constructs sounds a bit fluffy, be sure to check out Research Methodology Bootcamp , which will provide you with a rock-solid foundational understanding of all things methodology-related. Remember, you can take advantage of our 60% discount offer using this link.

Need a helping hand?

What Is Reliability?

As with validity, reliability is an attribute of a measurement instrument – for example, a survey, a weight scale or even a blood pressure monitor. But while validity is concerned with whether the instrument is measuring the “thing” it’s supposed to be measuring, reliability is concerned with consistency and stability . In other words, reliability reflects the degree to which a measurement instrument produces consistent results when applied repeatedly to the same phenomenon , under the same conditions .

As you can probably imagine, a measurement instrument that achieves a high level of consistency is naturally more dependable (or reliable) than one that doesn’t – in other words, it can be trusted to provide consistent measurements . And that, of course, is what you want when undertaking empirical research. If you think about it within a more domestic context, just imagine if you found that your bathroom scale gave you a different number every time you hopped on and off of it – you wouldn’t feel too confident in its ability to measure the variable that is your body weight 🙂

It’s worth mentioning that reliability also extends to the person using the measurement instrument . For example, if two researchers use the same instrument (let’s say a measuring tape) and they get different measurements, there’s likely an issue in terms of how one (or both) of them are using the measuring tape. So, when you think about reliability, consider both the instrument and the researcher as part of the equation.

As with validity, there are various types of reliability and various tests that can be used to assess the reliability of an instrument. A popular one that you’ll likely come across for survey instruments is Cronbach’s alpha , which is a statistical measure that quantifies the degree to which items within an instrument (for example, a set of Likert scales) measure the same underlying construct . In other words, Cronbach’s alpha indicates how closely related the items are and whether they consistently capture the same concept .

Reliability reflects whether an instrument produces consistent results when applied to the same phenomenon, under the same conditions.

Recap: Key Takeaways

Alright, let’s quickly recap to cement your understanding of validity and reliability:

Validity is concerned with whether an instrument (e.g., a set of Likert scales) is measuring what it’s supposed to measure
Reliability is concerned with whether that measurement is consistent and stable when measuring the same phenomenon under the same conditions.

In short, validity and reliability are both essential to ensuring that your data collection efforts deliver high-quality, accurate data that help you answer your research questions . So, be sure to always pay careful attention to the validity and reliability of your measurement instruments when collecting and analysing data. As the adage goes, “rubbish in, rubbish out” – make sure that your data inputs are rock-solid.

Psst… there’s more!

This post is an extract from our bestselling short course, Methodology Bootcamp . If you want to work smart, you don't want to miss this .

You Might Also Like:

THE MATERIAL IS WONDERFUL AND BENEFICIAL TO ALL STUDENTS.

THE MATERIAL IS WONDERFUL AND BENEFICIAL TO ALL STUDENTS AND I HAVE GREATLY BENEFITED FROM THE CONTENT.

Submit a Comment Cancel reply

Your email address will not be published. Required fields are marked *

Save my name, email, and website in this browser for the next time I comment.

Print Friendly

How it works

Reliability and Validity – Definitions, Types & Examples

Published by Alvin Nicolas at August 16th, 2021 , Revised On October 26, 2023

A researcher must test the collected data before making any conclusion. Every research design needs to be concerned with reliability and validity to measure the quality of the research.

What is Reliability?

Reliability refers to the consistency of the measurement. Reliability shows how trustworthy is the score of the test. If the collected data shows the same results after being tested using various methods and sample groups, the information is reliable. If your method has reliability, the results will be valid.

Example: If you weigh yourself on a weighing scale throughout the day, you’ll get the same results. These are considered reliable results obtained through repeated measures.

Example: If a teacher conducts the same math test of students and repeats it next week with the same questions. If she gets the same score, then the reliability of the test is high.

What is the Validity?

Validity refers to the accuracy of the measurement. Validity shows how a specific test is suitable for a particular situation. If the results are accurate according to the researcher’s situation, explanation, and prediction, then the research is valid.

If the method of measuring is accurate, then it’ll produce accurate results. If a method is reliable, then it’s valid. In contrast, if a method is not reliable, it’s not valid.

Example: Your weighing scale shows different results each time you weigh yourself within a day even after handling it carefully, and weighing before and after meals. Your weighing machine might be malfunctioning. It means your method had low reliability. Hence you are getting inaccurate or inconsistent results that are not valid.

Example: Suppose a questionnaire is distributed among a group of people to check the quality of a skincare product and repeated the same questionnaire with many groups. If you get the same response from various participants, it means the validity of the questionnaire and product is high as it has high reliability.

Most of the time, validity is difficult to measure even though the process of measurement is reliable. It isn’t easy to interpret the real situation.

Example: If the weighing scale shows the same result, let’s say 70 kg each time, even if your actual weight is 55 kg, then it means the weighing scale is malfunctioning. However, it was showing consistent results, but it cannot be considered as reliable. It means the method has low reliability.

Internal Vs. External Validity

One of the key features of randomised designs is that they have significantly high internal and external validity.

Internal validity is the ability to draw a causal link between your treatment and the dependent variable of interest. It means the observed changes should be due to the experiment conducted, and any external factor should not influence the variables .

Example: age, level, height, and grade.

External validity is the ability to identify and generalise your study outcomes to the population at large. The relationship between the study’s situation and the situations outside the study is considered external validity.

Also, read about Inductive vs Deductive reasoning in this article.

Looking for reliable dissertation support?

We hear you.

Whether you want a full dissertation written or need help forming a dissertation proposal, we can help you with both.
Get different dissertation services at ResearchProspect and score amazing grades!

Threats to Interval Validity

Threats of external validity, how to assess reliability and validity.

Reliability can be measured by comparing the consistency of the procedure and its results. There are various methods to measure validity and reliability. Reliability can be measured through various statistical methods depending on the types of validity, as explained below:

Types of Reliability

Types of validity.

As we discussed above, the reliability of the measurement alone cannot determine its validity. Validity is difficult to be measured even if the method is reliable. The following type of tests is conducted for measuring validity.

Does your Research Methodology Have the Following?

Great Research/Sources
Perfect Language
Accurate Sources

If not, we can help. Our panel of experts makes sure to keep the 3 pillars of Research Methodology strong.

Does your Research Methodology Have the Following?

How to Increase Reliability?

Use an appropriate questionnaire to measure the competency level.
Ensure a consistent environment for participants
Make the participants familiar with the criteria of assessment.
Train the participants appropriately.
Analyse the research items regularly to avoid poor performance.

How to Increase Validity?

Ensuring Validity is also not an easy job. A proper functioning method to ensure validity is given below:

The reactivity should be minimised at the first concern.
The Hawthorne effect should be reduced.
The respondents should be motivated.
The intervals between the pre-test and post-test should not be lengthy.
Dropout rates should be avoided.
The inter-rater reliability should be ensured.
Control and experimental groups should be matched with each other.

How to Implement Reliability and Validity in your Thesis?

According to the experts, it is helpful if to implement the concept of reliability and Validity. Especially, in the thesis and the dissertation, these concepts are adopted much. The method for implementation given below:

Frequently Asked Questions

What is reliability and validity in research.

Reliability in research refers to the consistency and stability of measurements or findings. Validity relates to the accuracy and truthfulness of results, measuring what the study intends to. Both are crucial for trustworthy and credible research outcomes.

What is validity?

Validity in research refers to the extent to which a study accurately measures what it intends to measure. It ensures that the results are truly representative of the phenomena under investigation. Without validity, research findings may be irrelevant, misleading, or incorrect, limiting their applicability and credibility.

What is reliability?

Reliability in research refers to the consistency and stability of measurements over time. If a study is reliable, repeating the experiment or test under the same conditions should produce similar results. Without reliability, findings become unpredictable and lack dependability, potentially undermining the study’s credibility and generalisability.

What is reliability in psychology?

In psychology, reliability refers to the consistency of a measurement tool or test. A reliable psychological assessment produces stable and consistent results across different times, situations, or raters. It ensures that an instrument’s scores are not due to random error, making the findings dependable and reproducible in similar conditions.

What is test retest reliability?

Test-retest reliability assesses the consistency of measurements taken by a test over time. It involves administering the same test to the same participants at two different points in time and comparing the results. A high correlation between the scores indicates that the test produces stable and consistent results over time.

How to improve reliability of an experiment?

Standardise procedures and instructions.
Use consistent and precise measurement tools.
Train observers or raters to reduce subjective judgments.
Increase sample size to reduce random errors.
Conduct pilot studies to refine methods.
Repeat measurements or use multiple methods.
Address potential sources of variability.

What is the difference between reliability and validity?

Reliability refers to the consistency and repeatability of measurements, ensuring results are stable over time. Validity indicates how well an instrument measures what it’s intended to measure, ensuring accuracy and relevance. While a test can be reliable without being valid, a valid test must inherently be reliable. Both are essential for credible research.

Are interviews reliable and valid?

Interviews can be both reliable and valid, but they are susceptible to biases. The reliability and validity depend on the design, structure, and execution of the interview. Structured interviews with standardised questions improve reliability. Validity is enhanced when questions accurately capture the intended construct and when interviewer biases are minimised.

Are IQ tests valid and reliable?

IQ tests are generally considered reliable, producing consistent scores over time. Their validity, however, is a subject of debate. While they effectively measure certain cognitive skills, whether they capture the entirety of “intelligence” or predict success in all life areas is contested. Cultural bias and over-reliance on tests are also concerns.

Are questionnaires reliable and valid?

Questionnaires can be both reliable and valid if well-designed. Reliability is achieved when they produce consistent results over time or across similar populations. Validity is ensured when questions accurately measure the intended construct. However, factors like poorly phrased questions, respondent bias, and lack of standardisation can compromise their reliability and validity.

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, automatically generate references for free.

Knowledge Base
Methodology
Reliability vs Validity in Research | Differences, Types & Examples

Reliability vs Validity in Research | Differences, Types & Examples

Published on 3 May 2022 by Fiona Middleton . Revised on 10 October 2022.

Reliability and validity are concepts used to evaluate the quality of research. They indicate how well a method , technique, or test measures something. Reliability is about the consistency of a measure, and validity is about the accuracy of a measure.

It’s important to consider reliability and validity when you are creating your research design , planning your methods, and writing up your results, especially in quantitative research .

Understanding reliability vs validity, how are reliability and validity assessed, how to ensure validity and reliability in your research, where to write about reliability and validity in a thesis.

Reliability and validity are closely related, but they mean different things. A measurement can be reliable without being valid. However, if a measurement is valid, it is usually also reliable.

What is reliability?

Reliability refers to how consistently a method measures something. If the same result can be consistently achieved by using the same methods under the same circumstances, the measurement is considered reliable.

What is validity?

Validity refers to how accurately a method measures what it is intended to measure. If research has high validity, that means it produces results that correspond to real properties, characteristics, and variations in the physical or social world.

High reliability is one indicator that a measurement is valid. If a method is not reliable, it probably isn’t valid.

However, reliability on its own is not enough to ensure validity. Even if a test is reliable, it may not accurately reflect the real situation.

Validity is harder to assess than reliability, but it is even more important. To obtain useful results, the methods you use to collect your data must be valid: the research must be measuring what it claims to measure. This ensures that your discussion of the data and the conclusions you draw are also valid.

Prevent plagiarism, run a free check.

Reliability can be estimated by comparing different versions of the same measurement. Validity is harder to assess, but it can be estimated by comparing the results to other relevant data or theory. Methods of estimating reliability and validity are usually split up into different types.

Types of reliability

Different types of reliability can be estimated through various statistical methods.

Types of validity

The validity of a measurement can be estimated based on three main types of evidence. Each type can be evaluated through expert judgement or statistical methods.

To assess the validity of a cause-and-effect relationship, you also need to consider internal validity (the design of the experiment ) and external validity (the generalisability of the results).

The reliability and validity of your results depends on creating a strong research design , choosing appropriate methods and samples, and conducting the research carefully and consistently.

Ensuring validity

If you use scores or ratings to measure variations in something (such as psychological traits, levels of ability, or physical properties), it’s important that your results reflect the real variations as accurately as possible. Validity should be considered in the very earliest stages of your research, when you decide how you will collect your data .

Choose appropriate methods of measurement

Ensure that your method and measurement technique are of high quality and targeted to measure exactly what you want to know. They should be thoroughly researched and based on existing knowledge.

For example, to collect data on a personality trait, you could use a standardised questionnaire that is considered reliable and valid. If you develop your own questionnaire, it should be based on established theory or the findings of previous studies, and the questions should be carefully and precisely worded.

Use appropriate sampling methods to select your subjects

To produce valid generalisable results, clearly define the population you are researching (e.g., people from a specific age range, geographical location, or profession). Ensure that you have enough participants and that they are representative of the population.

Ensuring reliability

Reliability should be considered throughout the data collection process. When you use a tool or technique to collect data, it’s important that the results are precise, stable, and reproducible.

Apply your methods consistently

Plan your method carefully to make sure you carry out the same steps in the same way for each measurement. This is especially important if multiple researchers are involved.

For example, if you are conducting interviews or observations, clearly define how specific behaviours or responses will be counted, and make sure questions are phrased the same way each time.

Standardise the conditions of your research

When you collect your data, keep the circumstances as consistent as possible to reduce the influence of external factors that might create variation in the results.

For example, in an experimental setup, make sure all participants are given the same information and tested under the same conditions.

It’s appropriate to discuss reliability and validity in various sections of your thesis or dissertation or research paper. Showing that you have taken them into account in planning your research and interpreting the results makes your work more credible and trustworthy.

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the ‘Cite this Scribbr article’ button to automatically add the citation to our free Reference Generator.

Middleton, F. (2022, October 10). Reliability vs Validity in Research | Differences, Types & Examples. Scribbr. Retrieved 22 April 2024, from https://www.scribbr.co.uk/research-methods/reliability-or-validity/

Is this article helpful?

Fiona Middleton

Other students also liked, the 4 types of validity | types, definitions & examples, a quick guide to experimental design | 5 steps & examples, sampling methods | types, techniques, & examples.

Validity in research: a guide to measuring the right things

Last updated

27 February 2023

Reviewed by

Cathy Heath

Validity is necessary for all types of studies ranging from market validation of a business or product idea to the effectiveness of medical trials and procedures. So, how can you determine whether your research is valid? This guide can help you understand what validity is, the types of validity in research, and the factors that affect research validity.

Make research less tedious

Dovetail streamlines research to help you uncover and share actionable insights

What is validity?

In the most basic sense, validity is the quality of being based on truth or reason. Valid research strives to eliminate the effects of unrelated information and the circumstances under which evidence is collected.

Validity in research is the ability to conduct an accurate study with the right tools and conditions to yield acceptable and reliable data that can be reproduced. Researchers rely on carefully calibrated tools for precise measurements. However, collecting accurate information can be more of a challenge.

Studies must be conducted in environments that don't sway the results to achieve and maintain validity. They can be compromised by asking the wrong questions or relying on limited data.

Why is validity important in research?

Research is used to improve life for humans. Every product and discovery, from innovative medical breakthroughs to advanced new products, depends on accurate research to be dependable. Without it, the results couldn't be trusted, and products would likely fail. Businesses would lose money, and patients couldn't rely on medical treatments.

While wasting money on a lousy product is a concern, lack of validity paints a much grimmer picture in the medical field or producing automobiles and airplanes, for example. Whether you're launching an exciting new product or conducting scientific research, validity can determine success and failure.

What is reliability?

Reliability is the ability of a method to yield consistency. If the same result can be consistently achieved by using the same method to measure something, the measurement method is said to be reliable. For example, a thermometer that shows the same temperatures each time in a controlled environment is reliable.

While high reliability is a part of measuring validity, it's only part of the puzzle. If the reliable thermometer hasn't been properly calibrated and reliably measures temperatures two degrees too high, it doesn't provide a valid (accurate) measure of temperature.

Similarly, if a researcher uses a thermometer to measure weight, the results won't be accurate because it's the wrong tool for the job.

How are reliability and validity assessed?

While measuring reliability is a part of measuring validity, there are distinct ways to assess both measurements for accuracy.

How is reliability measured?

These measures of consistency and stability help assess reliability, including:

Consistency and stability of the same measure when repeated multiple times and conditions

Consistency and stability of the measure across different test subjects

Consistency and stability of results from different parts of a test designed to measure the same thing

How is validity measured?

Since validity refers to how accurately a method measures what it is intended to measure, it can be difficult to assess the accuracy. Validity can be estimated by comparing research results to other relevant data or theories.

The adherence of a measure to existing knowledge of how the concept is measured

The ability to cover all aspects of the concept being measured

The relation of the result in comparison with other valid measures of the same concept

What are the types of validity in a research design?

Research validity is broadly gathered into two groups: internal and external. Yet, this grouping doesn't clearly define the different types of validity. Research validity can be divided into seven distinct groups.

Face validity : A test that appears valid simply because of the appropriateness or relativity of the testing method, included information, or tools used.

Content validity : The determination that the measure used in research covers the full domain of the content.

Construct validity : The assessment of the suitability of the measurement tool to measure the activity being studied.

Internal validity : The assessment of how your research environment affects measurement results. This is where other factors can’t explain the extent of an observed cause-and-effect response.

External validity : The extent to which the study will be accurate beyond the sample and the level to which it can be generalized in other settings, populations, and measures.

Statistical conclusion validity: The determination of whether a relationship exists between procedures and outcomes (appropriate sampling and measuring procedures along with appropriate statistical tests).

Criterion-related validity : A measurement of the quality of your testing methods against a criterion measure (like a “gold standard” test) that is measured at the same time.

Examples of validity

Like different types of research and the various ways to measure validity, examples of validity can vary widely. These include:

A questionnaire may be considered valid because each question addresses specific and relevant aspects of the study subject.

In a brand assessment study, researchers can use comparison testing to verify the results of an initial study. For example, the results from a focus group response about brand perception are considered more valid when the results match that of a questionnaire answered by current and potential customers.

A test to measure a class of students' understanding of the English language contains reading, writing, listening, and speaking components to cover the full scope of how language is used.

Factors that affect research validity

Certain factors can affect research validity in both positive and negative ways. By understanding the factors that improve validity and those that threaten it, you can enhance the validity of your study. These include:

Random selection of participants vs. the selection of participants that are representative of your study criteria

Blinding with interventions the participants are unaware of (like the use of placebos)

Manipulating the experiment by inserting a variable that will change the results

Randomly assigning participants to treatment and control groups to avoid bias

Following specific procedures during the study to avoid unintended effects

Conducting a study in the field instead of a laboratory for more accurate results

Replicating the study with different factors or settings to compare results

Using statistical methods to adjust for inconclusive data

What are the common validity threats in research, and how can their effects be minimized or nullified?

Research validity can be difficult to achieve because of internal and external threats that produce inaccurate results. These factors can jeopardize validity.

History: Events that occur between an early and later measurement

Maturation: The passage of time in a study can include data on actions that would have naturally occurred outside of the settings of the study

Repeated testing: The outcome of repeated tests can change the outcome of followed tests

Selection of subjects: Unconscious bias which can result in the selection of uniform comparison groups

Statistical regression: Choosing subjects based on extremes doesn't yield an accurate outcome for the majority of individuals

Attrition: When the sample group is diminished significantly during the course of the study

Maturation: When subjects mature during the study, and natural maturation is awarded to the effects of the study

While some validity threats can be minimized or wholly nullified, removing all threats from a study is impossible. For example, random selection can remove unconscious bias and statistical regression.

Researchers can even hope to avoid attrition by using smaller study groups. Yet, smaller study groups could potentially affect the research in other ways. The best practice for researchers to prevent validity threats is through careful environmental planning and t reliable data-gathering methods.

How to ensure validity in your research

Researchers should be mindful of the importance of validity in the early planning stages of any study to avoid inaccurate results. Researchers must take the time to consider tools and methods as well as how the testing environment matches closely with the natural environment in which results will be used.

The following steps can be used to ensure validity in research:

Choose appropriate methods of measurement

Use appropriate sampling to choose test subjects

Create an accurate testing environment

How do you maintain validity in research?

Accurate research is usually conducted over a period of time with different test subjects. To maintain validity across an entire study, you must take specific steps to ensure that gathered data has the same levels of accuracy.

Consistency is crucial for maintaining validity in research. When researchers apply methods consistently and standardize the circumstances under which data is collected, validity can be maintained across the entire study.

Is there a need for validation of the research instrument before its implementation?

An essential part of validity is choosing the right research instrument or method for accurate results. Consider the thermometer that is reliable but still produces inaccurate results. You're unlikely to achieve research validity without activities like calibration, content, and construct validity.

Understanding research validity for more accurate results

Without validity, research can't provide the accuracy necessary to deliver a useful study. By getting a clear understanding of validity in research, you can take steps to improve your research skills and achieve more accurate results.

Get started today

Go from raw data to valuable insights with a flexible research platform

Editor’s picks

Last updated: 21 December 2023

Last updated: 16 December 2023

Last updated: 6 October 2023

Last updated: 25 November 2023

Last updated: 12 May 2023

Last updated: 15 February 2024

Last updated: 11 March 2024

Last updated: 12 December 2023

Last updated: 18 May 2023

Last updated: 6 March 2024

Last updated: 10 April 2023

Last updated: 20 December 2023

Latest articles

Understanding Reliability and Validity

These related research issues ask us to consider whether we are studying what we think we are studying and whether the measures we use are consistent.

Reliability

Reliability is the extent to which an experiment, test, or any measuring procedure yields the same result on repeated trials. Without the agreement of independent observers able to replicate research procedures, or the ability to use research tools and procedures that yield consistent measurements, researchers would be unable to satisfactorily draw conclusions, formulate theories, or make claims about the generalizability of their research. In addition to its important role in research, reliability is critical for many parts of our lives, including manufacturing, medicine, and sports.

Reliability is such an important concept that it has been defined in terms of its application to a wide range of activities. For researchers, four key types of reliability are:

Equivalency Reliability

Equivalency reliability is the extent to which two items measure identical concepts at an identical level of difficulty. Equivalency reliability is determined by relating two sets of test scores to one another to highlight the degree of relationship or association. In quantitative studies and particularly in experimental studies, a correlation coefficient, statistically referred to as r , is used to show the strength of the correlation between a dependent variable (the subject under study), and one or more independent variables , which are manipulated to determine effects on the dependent variable. An important consideration is that equivalency reliability is concerned with correlational, not causal, relationships.

For example, a researcher studying university English students happened to notice that when some students were studying for finals, their holiday shopping began. Intrigued by this, the researcher attempted to observe how often, or to what degree, this these two behaviors co-occurred throughout the academic year. The researcher used the results of the observations to assess the correlation between studying throughout the academic year and shopping for gifts. The researcher concluded there was poor equivalency reliability between the two actions. In other words, studying was not a reliable predictor of shopping for gifts.

Stability Reliability

Stability reliability (sometimes called test, re-test reliability) is the agreement of measuring instruments over time. To determine stability, a measure or test is repeated on the same subjects at a future date. Results are compared and correlated with the initial test to give a measure of stability.

An example of stability reliability would be the method of maintaining weights used by the U.S. Bureau of Standards. Platinum objects of fixed weight (one kilogram, one pound, etc...) are kept locked away. Once a year they are taken out and weighed, allowing scales to be reset so they are "weighing" accurately. Keeping track of how much the scales are off from year to year establishes a stability reliability for these instruments. In this instance, the platinum weights themselves are assumed to have a perfectly fixed stability reliability.

Internal Consistency

Internal consistency is the extent to which tests or procedures assess the same characteristic, skill or quality. It is a measure of the precision between the observers or of the measuring instruments used in a study. This type of reliability often helps researchers interpret data and predict the value of scores and the limits of the relationship among variables.

For example, a researcher designs a questionnaire to find out about college students' dissatisfaction with a particular textbook. Analyzing the internal consistency of the survey items dealing with dissatisfaction will reveal the extent to which items on the questionnaire focus on the notion of dissatisfaction.

Interrater Reliability

Interrater reliability is the extent to which two or more individuals (coders or raters) agree. Interrater reliability addresses the consistency of the implementation of a rating system.

A test of interrater reliability would be the following scenario: Two or more researchers are observing a high school classroom. The class is discussing a movie that they have just viewed as a group. The researchers have a sliding rating scale (1 being most positive, 5 being most negative) with which they are rating the student's oral responses. Interrater reliability assesses the consistency of how the rating system is implemented. For example, if one researcher gives a "1" to a student response, while another researcher gives a "5," obviously the interrater reliability would be inconsistent. Interrater reliability is dependent upon the ability of two or more individuals to be consistent. Training, education and monitoring skills can enhance interrater reliability.

Related Information: Reliability Example

An example of the importance of reliability is the use of measuring devices in Olympic track and field events. For the vast majority of people, ordinary measuring rulers and their degree of accuracy are reliable enough. However, for an Olympic event, such as the discus throw, the slightest variation in a measuring device -- whether it is a tape, clock, or other device -- could mean the difference between the gold and silver medals. Additionally, it could mean the difference between a new world record and outright failure to qualify for an event. Olympic measuring devices, then, must be reliable from one throw or race to another and from one competition to another. They must also be reliable when used in different parts of the world, as temperature, air pressure, humidity, interpretation, or other variables might affect their readings.

Validity refers to the degree to which a study accurately reflects or assesses the specific concept that the researcher is attempting to measure. While reliability is concerned with the accuracy of the actual measuring instrument or procedure, validity is concerned with the study's success at measuring what the researchers set out to measure.

Researchers should be concerned with both external and internal validity. External validity refers to the extent to which the results of a study are generalizable or transferable. (Most discussions of external validity focus solely on generalizability; see Campbell and Stanley, 1966. We include a reference here to transferability because many qualitative research studies are not designed to be generalized.)

Internal validity refers to (1) the rigor with which the study was conducted (e.g., the study's design, the care taken to conduct measurements, and decisions concerning what was and wasn't measured) and (2) the extent to which the designers of a study have taken into account alternative explanations for any causal relationships they explore (Huitt, 1998). In studies that do not explore causal relationships, only the first of these definitions should be considered when assessing internal validity.

Scholars discuss several types of internal validity. For brief discussions of several types of internal validity, click on the items below:

Face Validity

Face validity is concerned with how a measure or procedure appears. Does it seem like a reasonable way to gain the information the researchers are attempting to obtain? Does it seem well designed? Does it seem as though it will work reliably? Unlike content validity, face validity does not depend on established theories for support (Fink, 1995).

Criterion Related Validity

Criterion related validity, also referred to as instrumental validity, is used to demonstrate the accuracy of a measure or procedure by comparing it with another measure or procedure which has been demonstrated to be valid.

For example, imagine a hands-on driving test has been shown to be an accurate test of driving skills. By comparing the scores on the written driving test with the scores from the hands-on driving test, the written test can be validated by using a criterion related strategy in which the hands-on driving test is compared to the written test.

Construct Validity

Construct validity seeks agreement between a theoretical concept and a specific measuring device or procedure. For example, a researcher inventing a new IQ test might spend a great deal of time attempting to "define" intelligence in order to reach an acceptable level of construct validity.

Construct validity can be broken down into two sub-categories: Convergent validity and discriminate validity. Convergent validity is the actual general agreement among ratings, gathered independently of one another, where measures should be theoretically related. Discriminate validity is the lack of a relationship among measures which theoretically should not be related.

To understand whether a piece of research has construct validity, three steps should be followed. First, the theoretical relationships must be specified. Second, the empirical relationships between the measures of the concepts must be examined. Third, the empirical evidence must be interpreted in terms of how it clarifies the construct validity of the particular measure being tested (Carmines & Zeller, p. 23).

Content Validity

Content Validity is based on the extent to which a measurement reflects the specific intended domain of content (Carmines & Zeller, 1991, p.20).

Content validity is illustrated using the following examples: Researchers aim to study mathematical learning and create a survey to test for mathematical skill. If these researchers only tested for multiplication and then drew conclusions from that survey, their study would not show content validity because it excludes other mathematical functions. Although the establishment of content validity for placement-type exams seems relatively straight-forward, the process becomes more complex as it moves into the more abstract domain of socio-cultural studies. For example, a researcher needing to measure an attitude like self-esteem must decide what constitutes a relevant domain of content for that attitude. For socio-cultural studies, content validity forces the researchers to define the very domains they are attempting to study.

Related Information: Validity Example

Many recreational activities of high school students involve driving cars. A researcher, wanting to measure whether recreational activities have a negative effect on grade point average in high school students, might conduct a survey asking how many students drive to school and then attempt to find a correlation between these two factors. Because many students might use their cars for purposes other than or in addition to recreation (e.g., driving to work after school, driving to school rather than walking or taking a bus), this research study might prove invalid. Even if a strong correlation was found between driving and grade point average, driving to school in and of itself would seem to be an invalid measure of recreational activity.

The challenges of achieving reliability and validity are among the most difficult faced by researchers. In this section, we offer commentaries on these challenges.

Difficulties of Achieving Reliability

It is important to understand some of the problems concerning reliability which might arise. It would be ideal to reliably measure, every time, exactly those things which we intend to measure. However, researchers can go to great lengths and make every attempt to ensure accuracy in their studies, and still deal with the inherent difficulties of measuring particular events or behaviors. Sometimes, and particularly in studies of natural settings, the only measuring device available is the researcher's own observations of human interaction or human reaction to varying stimuli. As these methods are ultimately subjective in nature, results may be unreliable and multiple interpretations are possible. Three of these inherent difficulties are quixotic reliability, diachronic reliability and synchronic reliability.

Quixotic reliability refers to the situation where a single manner of observation consistently, yet erroneously, yields the same result. It is often a problem when research appears to be going well. This consistency might seem to suggest that the experiment was demonstrating perfect stability reliability. This, however, would not be the case.

For example, if a measuring device used in an Olympic competition always read 100 meters for every discus throw, this would be an example of an instrument consistently, yet erroneously, yielding the same result. However, quixotic reliability is often more subtle in its occurrences than this. For example, suppose a group of German researchers doing an ethnographic study of American attitudes ask questions and record responses. Parts of their study might produce responses which seem reliable, yet turn out to measure felicitous verbal embellishments required for "correct" social behavior. Asking Americans, "How are you?" for example, would in most cases, elicit the token, "Fine, thanks." However, this response would not accurately represent the mental or physical state of the respondents.

Diachronic reliability refers to the stability of observations over time. It is similar to stability reliability in that it deals with time. While this type of reliability is appropriate to assess features that remain relatively unchanged over time, such as landscape benchmarks or buildings, the same level of reliability is more difficult to achieve with socio-cultural phenomena.

For example, in a follow-up study one year later of reading comprehension in a specific group of school children, diachronic reliability would be hard to achieve. If the test were given to the same subjects a year later, many confounding variables would have impacted the researchers' ability to reproduce the same circumstances present at the first test. The final results would almost assuredly not reflect the degree of stability sought by the researchers.

Synchronic reliability refers to the similarity of observations within the same time frame; it is not about the similarity of things observed. Synchronic reliability, unlike diachronic reliability, rarely involves observations of identical things. Rather, it concerns itself with particularities of interest to the research.

For example, a researcher studies the actions of a duck's wing in flight and the actions of a hummingbird's wing in flight. Despite the fact that the researcher is studying two distinctly different kinds of wings, the action of the wings and the phenomenon produced is the same.

Comments on a Flawed, Yet Influential Study

An example of the dangers of generalizing from research that is inconsistent, invalid, unreliable, and incomplete is found in the Time magazine article, "On A Screen Near You: Cyberporn" (De Witt, 1995). This article relies on a study done at Carnegie Mellon University to determine the extent and implications of online pornography. Inherent to the study are methodological problems of unqualified hypotheses and conclusions, unsupported generalizations and a lack of peer review.

Ignoring the functional problems that manifest themselves later in the study, it seems that there are a number of ethical problems within the article. The article claims to be an exhaustive study of pornography on the Internet, (it was anything but exhaustive), it resembles a case study more than anything else. Marty Rimm, author of the undergraduate paper that Time used as a basis for the article, claims the paper was an "exhaustive study" of online pornography when, in fact, the study based most of its conclusions about pornography on the Internet on the "descriptions of slightly more than 4,000 images" (Meeks, 1995, p. 1). Some USENET groups see hundreds of postings in a day.

Considering the thousands of USENET groups, 4,000 images no longer carries the authoritative weight that its author intended. The real problem is that the study (an undergraduate paper similar to a second-semester composition assignment) was based not on pornographic images themselves, but on the descriptions of those images. This kind of reduction detracts significantly from the integrity of the final claims made by the author. In fact, this kind of research is commensurate with doing a study of the content of pornographic movies based on the titles of the movies, then making sociological generalizations based on what those titles indicate. (This is obviously a problem with a number of types of validity, because Rimm is not studying what he thinks he is studying, but instead something quite different. )

The author of the Time article, Philip Elmer De Witt writes, "The research team at CMU has undertaken the first systematic study of pornography on the Information Superhighway" (Godwin, 1995, p. 1). His statement is problematic in at least three ways. First, the research team actually consisted of a few of Rimm's undergraduate friends with no methodological training whatsoever. Additionally, no mention of the degree of interrater reliability is made. Second, this systematic study is actually merely a "non-randomly selected subset of commercial bulletin-board systems that focus on selling porn" (Godwin, p. 6). As pornography vending is actually just a small part of the whole concerning the use of pornography on the Internet, the entire premise of this study's content validity is firmly called into question. Finally, the use of the term "Information Superhighway" is a false assessment of what in actuality is only a few USENET groups and BBSs (Bulletin Board System), which make up only a small fraction of the entire "Information Superhighway" traffic. Essentially, what is here is yet another violation of content validity.

De Witt is quoted as saying: "In an 18-month study, the team surveyed 917,410 sexually-explicit pictures, descriptions, short-stories and film clips. On those USENET newsgroups where digitized images are stored, 83.5 percent of the pictures were pornographic" (De Witt 40).

Statistically, some interesting contradictions arise. The figure 917,410 was taken from adult-oriented BBSs--none came from actual USENET groups or the Internet itself. This is a glaring discrepancy. Out of the 917,410 files, 212,114 are only descriptions (Hoffman & Novak, 1995, p.2). The question is, how many actual images did the "researchers" see?

"Between April and July 1994, the research team downloaded all available images (3,254)...the team encountered technical difficulties with 13 percent of these images...This left a total of 2,830 images for analysis" (p. 2). This means that out of 917,410 files discussed in this study, 914,580 of them were not even pictures! As for the 83.5 percent figure, this is actually based on "17 alt.binaries groups that Rimm considered pornographic" (p. 2).

In real terms, 17 USENET groups is a fraction of a percent of all USENET groups available. Worse yet, Time claimed that "...only about 3 percent of all messages on the USENET [represent pornographic material], while the USENET itself represents 11.5 percent of the traffic on the Internet" (De Witt, p. 40).

Time neglected to carry the interpretation of this data out to its logical conclusion, which is that less than half of 1 percent (3 percent of 11 percent) of the images on the Internet are associated with newsgroups that contain pornographic imagery. Furthermore, of this half percent, an unknown but even smaller percentage of the messages in newsgroups that are 'associated with pornographic imagery', actually contained pornographic material (Hoffman & Novak, p. 3).

Another blunder can be seen in the avoidance of peer-review, which suggests that there was some political interests being served in having the study become a Time cover story. Marty Rimm contracted the Georgetown Law Review and Time in an agreement to publish his study as long as they kept it under lock and key. During the months before publication, many interested scholars and professionals tried in vain to obtain a copy of the study in order to check it for flaws. De Witt justified not letting such peer-review take place, and also justified the reliability and validity of the study, on the grounds that because the Georgetown Law Review had accepted it, it was therefore reliable and valid, and needed no peer-review. What he didn't know, was that law reviews are not edited by professionals, but by "third year law students" (Godwin, p. 4).

There are many consequences of the failure to subject such a study to the scrutiny of peer review. If it was Rimm's desire to publish an article about on-line pornography in a manner that legitimized his article, yet escaped the kind of critical review the piece would have to undergo if published in a scholarly journal of computer-science, engineering, marketing, psychology, or communications. What better venue than a law journal? A law journal article would have the added advantage of being taken seriously by law professors, lawyers, and legally-trained policymakers. By virtue of where it appeared, it would automatically be catapulted into the center of the policy debate surrounding online censorship and freedom of speech (Godwin).

Herein lies the dangerous implication of such a study: Because the questions surrounding pornography are of such immediate political concern, the study was placed in the forefront of the U.S. domestic policy debate over censorship on the Internet, (an integral aspect of current anti-First Amendment legislation) with little regard for its validity or reliability.

On June 26, the day the article came out, Senator Grassley, (co-sponsor of the anti-porn bill, along with Senator Dole) began drafting a speech that was to be delivered that very day in the Senate, using the study as evidence. The same day, at the same time, Mike Godwin posted on WELL (Whole Earth 'Lectronic Link, a forum for professionals on the Internet) what turned out to be the overstatement of the year: "Philip's story is an utter disaster, and it will damage the debate about this issue because we will have to spend lots of time correcting misunderstandings that are directly attributable to the story" (Meeks, p. 7).

As Godwin was writing this, Senator Grassley was speaking to the Senate: "Mr. President, I want to repeat that: 83.5 percent of the 900,000 images reviewed--these are all on the Internet--are pornographic, according to the Carnegie-Mellon study" ( p. 7). Several days later, Senator Dole was waving the magazine in front of the Senate like a battle flag.

Donna Hoffman, professor at Vanderbilt University, summed up the dangerous political implications by saying, "The critically important national debate over First Amendment rights and restrictions of information on the Internet and other emerging media requires facts and informed opinion, not hysteria" (p.1).

In addition to the hysteria, Hoffman sees a plethora of other problems with the study. "Because the content analysis and classification scheme are 'black boxes,'" Hoffman said, "because no reliability and validity results are presented, because no statistical testing of the differences both within and among categories for different types of listings has been performed, and because not a single hypothesis has been tested, formally or otherwise, no conclusions should be drawn until the issues raised in this critique are resolved" (p. 4).

However, the damage has already been done. This questionable research by an undergraduate engineering major has been generalized to such an extent that even the U.S. Senate, and in particular Senators Grassley and Dole, have been duped, albeit through the strength of their own desires to see only what they wanted to see.

Annotated Bibliography

American Psychological Association. (1985). Standards for educational and psychological testing. Washington, DC: Author.

This work on focuses on reliability, validity and the standards that testers need to achieve in order to ensure accuracy.

Babbie, E.R. & Huitt, R.E. (1979). The practice of social research 2nd ed . Belmont, CA: Wadsworth Publishing.

An overview of social research and its applications.

Beauchamp, T. L., Faden, R.R., Wallace, Jr., R.J. & Walters, L . ( 1982). Ethical issues in social science research. Baltimore and London: The Johns Hopkins University Press.

A systematic overview of ethical issues in Social Science Research written by researchers with firsthand familiarity with the situations and problems researchers face in their work. This book raises several questions of how reliability and validity can be affected by ethics.

Borman, K.M. et al. (1986). Ethnographic and qualitative research design and why it doesn't work. American behavioral scientist 30 , 42-57.

The authors pose questions concerning threats to qualitative research and suggest solutions.

Bowen, K. A. (1996, Oct. 12). The sin of omission -punishable by death to internal validity: An argument for integration of quantitative research methods to strengthen internal validity. Available: http://trochim.human.cornell.edu/gallery/bowen/hss691.htm

An entire Web site that examines the merits of integrating qualitative and quantitative research methodologies through triangulation. The author argues that improving the internal validity of social science will be the result of such a union.

Brinberg, D. & McGrath, J.E. (1985). Validity and the research process . Beverly Hills: Sage Publications.

The authors investigate validity as value and propose the Validity Network Schema, a process by which researchers can infuse validity into their research.

Bussières, J-F. (1996, Oct.12). Reliability and validity of information provided by museum Web sites. Available: http://www.oise.on.ca/~jfbussieres/issue.html

This Web page examines the validity of museum Web sites which calls into question the validity of Web-based resources in general. Addresses the issue that all Websites should be examined with skepticism about the validity of the information contained within them.

Campbell, D. T. & Stanley, J.C. (1963). Experimental and quasi-experimental designs for research. Boston: Houghton Mifflin.

An overview of experimental research that includes pre-experimental designs, controls for internal validity, and tables listing sources of invalidity in quasi-experimental designs. Reference list and examples.

Carmines, E. G. & Zeller, R.A. (1991). Reliability and validity assessment . Newbury Park: Sage Publications.

An introduction to research methodology that includes classical test theory, validity, and methods of assessing reliability.

Carroll, K. M. (1995). Methodological issues and problems in the assessment of substance use. Psychological Assessment, Sep. 7 n3 , 349-58.

Discusses methodological issues in research involving the assessment of substance abuse. Introduces strategies for avoiding problems with the reliability and validity of methods.

Connelly, F. M. & Clandinin, D.J. (1990). Stories of experience and narrative inquiry. Educational Researcher 19:5 , 2-12.

A survey of narrative inquiry that outlines criteria, methods, and writing forms. It includes a discussion of risks and dangers in narrative studies, as well as a research agenda for curricula and classroom studies.

De Witt, P.E. (1995, July 3). On a screen near you: Cyberporn. Time, 38-45.

This is an exhaustive Carnegie Mellon study of online pornography by Marty Rimm, electrical engineering student.

Fink, A., ed. (1995). The survey Handbook, v.1 .Thousand Oaks, CA: Sage.

A guide to survey; this is the first in a series referred to as the "survey kit". It includes bibliograpgical references. Addresses survey design, analysis, reporting surveys and how to measure the validity and reliability of surveys.

Fink, A., ed. (1995). How to measure survey reliability and validity v. 7 . Thousand Oaks, CA: Sage.

This volume seeks to select and apply reliability criteria and select and apply validity criteria. The fundamental principles of scaling and scoring are considered.

Godwin, M. (1995, July). JournoPorn, dissection of the Time article. Available: http://www.hotwired.com

A detailed critique of Time magazine's Cyberporn , outlining flaws of methodology as well as exploring the underlying assumptions of the article.

Hambleton, R.K. & Zaal, J.N., eds. (1991). Advances in educational and psychological testing . Boston: Kluwer Academic.

Information on the concepts of reliability and validity in psychology and education.

Harnish, D.L. (1992). Human judgment and the logic of evidence: A critical examination of research methods in special education transition literature . In D.L. Harnish et al. eds., Selected readings in transition.

This article investigates threats to validity in special education research.

Haynes, N. M. (1995). How skewed is 'the bell curve'? Book Product Reviews . 1-24.

This paper claims that R.J. Herrnstein and C. Murray's The Bell Curve: Intelligence and Class Structure in American Life does not have scientific merit and claims that the bell curve is an unreliable measure of intelligence.

Healey, J. F. (1993). Statistics: A tool for social research, 3rd ed . Belmont: Wadsworth Publishing.

Inferential statistics, measures of association, and multivariate techniques in statistical analysis for social scientists are addressed.

Helberg, C. (1996, Oct.12). Pitfalls of data analysis (or how to avoid lies and damned lies). Available: http//maddog/fammed.wisc.edu/pitfalls/

A discussion of things researchers often overlook in their data analysis and how statistics are often used to skew reliability and validity for the researchers purposes.

Hoffman, D. L. and Novak, T.P. (1995, July). A detailed critique of the Time article: Cyberporn. Available: http://www.hotwired.com

A methodological critique of the Time article that uncovers some of the fundamental flaws in the statistics and the conclusions made by De Witt.

Huitt, William G. (1998). Internal and External Validity . http://www.valdosta.peachnet.edu/~whuitt/psy702/intro/valdgn.html

A Web document addressing key issues of external and internal validity.

Jones, J. E. & Bearley, W.L. (1996, Oct 12). Reliability and validity of training instruments. Organizational Universe Systems. Available: http://ous.usa.net/relval.htm

The authors discuss the reliability and validity of training design in a business setting. Basic terms are defined and examples provided.

Cultural Anthropology Methods Journal. (1996, Oct. 12). Available: http://www.lawrence.edu/~bradleyc/cam.html

An online journal containing articles on the practical application of research methods when conducting qualitative and quantitative research. Reliability and validity are addressed throughout.

Kirk, J. & Miller, M. M. (1986). Reliability and validity in qualitative research. Beverly Hills: Sage Publications.

This text describes objectivity in qualitative research by focusing on the issues of validity and reliability in terms of their limitations and applicability in the social and natural sciences.

Krakower, J. & Niwa, S. (1985). An assessment of validity and reliability of the institutinal perfarmance survey . Boulder, CO: National center for higher education management systems.

Educational surveys and higher education research and the efeectiveness of organization.

Lauer, J. M. & Asher, J.W. (1988). Composition Research. New York: Oxford University Press.

A discussion of empirical designs in the context of composition research as a whole.

Laurent, J. et al. (1992, Mar.) Review of validity research on the stanford-binet intelligence scale: 4th Ed. Psychological Assessment . 102-112.

This paper looks at the results of construct and criterion- related validity studies to determine if the SB:FE is a valid measure of intelligence.

LeCompte, M. D., Millroy, W.L., & Preissle, J. eds. (1992). The handbook of qualitative research in education. San Diego: Academic Press.

A compilation of the range of methodological and theoretical qualitative inquiry in the human sciences and education research. Numerous contributing authors apply their expertise to discussing a wide variety of issues pertaining to educational and humanities research as well as suggestions about how to deal with problems when conducting research.

McDowell, I. & Newell, C. (1987). Measuring health: A guide to rating scales and questionnaires . New York: Oxford University Press.

This gives a variety of examples of health measurement techniques and scales and discusses the validity and reliability of important health measures.

Meeks, B. (1995, July). Muckraker: How Time failed. Available: http://www.hotwired.com

A step-by-step outline of the events which took place during the researching, writing, and negotiating of the Time article of 3 July, 1995 titled: On A Screen Near You: Cyberporn .

Merriam, S. B. (1995). What can you tell from an N of 1?: Issues of validity and reliability in qualitative research. Journal of Lifelong Learning v4 , 51-60.

Addresses issues of validity and reliability in qualitative research for education. Discusses philosophical assumptions underlying the concepts of internal validity, reliability, and external validity or generalizability. Presents strategies for ensuring rigor and trustworthiness when conducting qualitative research.

Morris, L.L, Fitzgibbon, C.T., & Lindheim, E. (1987). How to measure performance and use tests. In J.L. Herman (Ed.), Program evaluation kit (2nd ed.). Newbury Park, CA: Sage.

Discussion of reliability and validity as it pertyains to measuring students' performance.

Murray, S., et al. (1979, April). Technical issues as threats to internal validity of experimental and quasi-experimental designs. San Francisco: University of California. 8-12.

(From Yang et al. bibliography--unavailable as of this writing.)

Russ-Eft, D. F. (1980). Validity and reliability in survey research. American Institutes for Research in the Behavioral Sciences August , 227 151.

An investigation of validity and reliability in survey research with and overview of the concepts of reliability and validity. Specific procedures for measuring sources of error are suggested as well as general suggestions for improving the reliability and validity of survey data. A extensive annotated bibliography is provided.

Ryser, G. R. (1994). Developing reliable and valid authentic assessments for the classroom: Is it possible? Journal of Secondary Gifted Education Fall, v6 n1 , 62-66.

Defines the meanings of reliability and validity as they apply to standardized measures of classroom assessment. This article defines reliability as scorability and stability and validity is seen as students' ability to use knowledge authentically in the field.

Schmidt, W., et al. (1982). Validity as a variable: Can the same certification test be valid for all students? Institute for Research on Teaching July, ED 227 151.

A technical report that presents specific criteria for judging content, instructional and curricular validity as related to certification tests in education.

Scholfield, P. (1995). Quantifying language. A researcher's and teacher's guide to gathering language data and reducing it to figures . Bristol: Multilingual Matters.

A guide to categorizing, measuring, testing, and assessing aspects of language. A source for language-related practitioners and researchers in conjunction with other resources on research methods and statistics. Questions of reliability, and validity are also explored.

Scriven, M. (1993). Hard-Won Lessons in Program Evaluation . San Francisco: Jossey-Bass Publishers.

A common sense approach for evaluating the validity of various educational programs and how to address specific issues facing evaluators.

Shou, P. (1993, Jan.). The singer loomis inventory of personality: A review and critique. [Paper presented at the Annual Meeting of the Southwest Educational Research Association.]

Evidence for reliability and validity are reviewed. A summary evaluation suggests that SLIP (developed by two Jungian analysts to allow examination of personality from the perspective of Jung's typology) appears to be a useful tool for educators and counselors.

Sutton, L.R. (1992). Community college teacher evaluation instrument: A reliability and validity study . Diss. Colorado State University.

Studies of reliability and validity in occupational and educational research.

Thompson, B. & Daniel, L.G. (1996, Oct.). Seminal readings on reliability and validity: A "hit parade" bibliography. Educational and psychological measurement v. 56 , 741-745.

Editorial board members of Educational and Psychological Measurement generated bibliography of definitive publications of measurement research. Many articles are directly related to reliability and validity.

Thompson, E. Y., et al. (1995). Overview of qualitative research . Diss. Colorado State University.

A discussion of strengths and weaknesses of qualitative research and its evolution and adaptation. Appendices and annotated bibliography.

Traver, C. et al. (1995). Case Study . Diss. Colorado State University.

This presentation gives an overview of case study research, providing definitions and a brief history and explanation of how to design research.

Trochim, William M. K. (1996) External validity. (. Available: http://trochim.human.cornell.edu/kb/EXTERVAL.htm

A comprehensive treatment of external validity found in William Trochim's online text about research methods and issues.

Trochim, William M. K. (1996) Introduction to validity. (. Available: hhttp://trochim.human.cornell.edu/kb/INTROVAL.htm

An introduction to validity found in William Trochim's online text about research methods and issues.

Trochim, William M. K. (1996) Reliability. (. Available: http://trochim.human.cornell.edu/kb/reltypes.htm

A comprehensive treatment of reliability found in William Trochim's online text about research methods and issues.

Validity. (1996, Oct. 12). Available: http://vislab-www.nps.navy.mil/~haga/validity.html

A source for definitions of various forms and types of reliability and validity.

Vinsonhaler, J. F., et al. (1983, July). Improving diagnostic reliability in reading through training. Institute for Research on Teaching ED 237 934.

This technical report investigates the practical application of a program intended to improve the diagnoses of reading deficient students. Here, reliability is assumed and a pragmatic answer to a specific educational problem is suggested as a result.

Wentland, E. J. & Smith, K.W. (1993). Survey responses: An evaluation of their validity . San Diego: Academic Press.

This book looks at the factors affecting response validity (or the accuracy of self-reports in surveys) and provides several examples with varying accuracy levels.

Wiget, A. (1996). Father juan greyrobe: Reconstructing tradition histories, and the reliability and validity of uncorroborated oral tradition. Ethnohistory 43:3 , 459-482.

This paper presents a convincing argument for the validity of oral histories in ethnographic research where at least some of the evidence can be corroborated through written records.

Yang, G. H., et al. (1995). Experimental and quasi-experimental educational research . Diss. Colorado State University.

This discussion defines experimentation and considers the rhetorical issues and advantages and disadvantages of experimental research. Annotated bibliography.

Yarroch, W. L. (1991, Sept.). The Implications of content versus validity on science tests. Journal of Research in Science Teaching , 619-629.

The use of content validity as the primary assurance of the measurement accuracy for science assessment examinations is questioned. An alternative accuracy measure, item validity, is proposed to look at qualitative comparisons between different factors.

Yin, R. K. (1989). Case study research: Design and methods . London: Sage Publications.

This book discusses the design process of case study research, including collection of evidence, composing the case study report, and designing single and multiple case studies.

Validity, Accuracy and Reliability Explained with Examples

This is part of the NSW HSC science curriculum part of the Working Scientifically skills.

Part 1 – Validity

Part 2 – Accuracy

Part 3 – Reliability

Science experiments are an essential part of high school education, helping students understand key concepts and develop critical thinking skills. However, the value of an experiment lies in its validity, accuracy, and reliability. Let's break down these terms and explore how they can be improved and reduced, using simple experiments as examples.

Target Analogy to Understand Accuracy and Reliability

The target analogy is a classic way to understand the concepts of accuracy and reliability in scientific measurements and experiments.

Accuracy refers to how close a measurement is to the true or accepted value. In the analogy, it's how close the arrows come to hitting the bullseye (represents the true or accepted value).

Reliability refers to the consistency of a set of measurements. Reliable data can be reproduced under the same conditions. In the analogy, it's represented by how tightly the arrows are grouped together, regardless of whether they hit the bullseye. Therefore, we can have scientific results that are reliable but inaccurate.

Validity refers to how well an experiment investigates the aim or tests the underlying hypothesis. While validity is not represented in this target analogy, the validity of an experiment can sometimes be assessed by using the accuracy of results as a proxy. Experiments that produce accurate results are likely to be valid as invalid experiments usually do not yield accurate result.

Validity refers to how well an experiment measures what it is supposed to measure and investigates the aim.

Ask yourself the questions:

"Is my experimental method and design suitable?"
"Is my experiment testing or investigating what it's suppose to?"

For example, if you're investigating the effect of the volume of water (independent variable) on plant growth, your experiment would be valid if you measure growth factors like height or leaf size (these would be your dependent variables).

However, validity entails more than just what's being measured. When assessing validity, you should also examine how well the experimental methodology investigates the aim of the experiment.

Assessing Validity

An experiment’s procedure, the subsequent methods of analysis of the data, the data itself, and the conclusion you draw from the data, all have their own associated validities. It is important to understand this division because there are different factors to consider when assessing the validity of any single one of them. The validity of an experiment as a whole , depends on the individual validities of these components.

When assessing the validity of the procedure , consider the following:

Does the procedure control all necessary variables except for the dependent and independent variables? That is, have you isolated the effect of the independent variable on the dependent variable?
Does this effect you have isolated actually address the aim and/or hypothesis?
Does your method include enough repetitions for a reliable result? (Read more about reliability below)

When assessing the validity of the method of analysis of the data , consider the following:

Does the analysis extrapolate or interpolate the experimental data? Generally, interpolation is valid, but extrapolation is invalid. This because by extrapolating, you are ‘peering out into the darkness’ – just because your data showed a certain trend for a certain range it does not mean that this trend will hold for all.
Does the analysis use accepted laws and mathematical relationships? That is, do the equations used for analysis have scientific or mathematical base? For example, `F = ma` is an accepted law in physics, but if in the analysis you made up a relationship like `F = ma^2` that has no scientific or mathematical backing, the method of analysis is invalid.
Is the most appropriate method of analysis used? Consider the differences between using a table and a graph. In a graph, you can use the gradient to minimise the effects of systematic errors and can also reduce the effect of random errors. The visual nature of a graph also allows you to easily identify outliers and potentially exclude them from analysis. This is why graphical analysis is generally more valid than using values from tables.

When assessing the validity of your results , consider the following:

Is your primary data (data you collected from your own experiment) BOTH accurate and reliable? If not, it is invalid.
Are the secondary sources you may have used BOTH reliable and accurate?

When assessing the validity of your conclusion , consider the following:

Does your conclusion relate directly to the aim or the hypothesis?

How to Improve Validity

Ways of improving validity will differ across experiments. You must first identify what area(s) of the experiment’s validity is lacking (is it the procedure, analysis, results, or conclusion?). Then, you must come up with ways of overcoming the particular weakness.

Below are some examples of this.

Example – Validity in Chemistry Experiment

Let's say we want to measure the mass of carbon dioxide in a can of soft drink.

The following steps are followed:

Weigh an unopened can of soft drink on an electronic balance.
Open the can.
Place the can on a hot plate until it begins to boil.
When cool, re-weigh the can to determine the mass loss.

To ensure this experiment is valid, we must establish controlled variables:

type of soft drink used
temperature at which this experiment is conducted
period of time before soft drink is re-weighed

Despite these controlled variables, this experiment is invalid because it actually doesn't help us measure the mass of carbon dioxide in the soft drink. This is because by heating the soft drink until it boils, we are also losing water due to evaporation. As a result, the mass loss measured is not only due to the loss of carbon dioxide, but also water. A simple way to improve the validity of this experiment is to not heat it; by simply opening the can of soft drink, carbon dioxide in the can will escape without loss of water.

Example – Validity in Physics Experiment

Let's say we want to measure the value of gravitational acceleration `g` using a simple pendulum system, and the following equation:

$$T = 2\pi \sqrt{\frac{l}{g}}$$

`T` is the period of oscillation
`l` is the length of string attached to the mass
`g` is the acceleration due to gravity

Cut a piece of a string or dental floss so that it is 1.0 m long.
Attach a 500.0 g mass of high density to the end of the string.
Attach the other end of the string to the retort stand using a clamp.
Starting at an angle of less than 10º, allow the pendulum to swing and measure the pendulum’s period for 10 oscillations using a stopwatch.
Repeat the experiment with 1.2 m, 1.5 m and 1.8 m strings.

The controlled variables we must established in this experiment include:

mass used in the pendulum
location at which the experiment is conducted

The validity of this experiment depends on the starting angle of oscillation. The above equation (method of analysis) is only true for small angles (`\theta < 15^{\circ}`) such that `\sin \theta = \theta`. We also want to make sure the pendulum system has a small enough surface area to minimise the effect of air resistance on its oscillation.

In this instance, it would be invalid to use a pair of values (length and period) to calculate the value of gravitational acceleration. A more appropriate method of analysis would be to plot the length and period squared to obtain a linear relationship, then use the value of the gradient of the line of best fit to determine the value of `g`.

Accuracy refers to how close the experimental measurements are to the true value.

Accuracy depends on

the validity of the experiment
the degree of error:
systematic errors are those that are systemic in your experiment. That is, they effect every single one of your data points consistently, meaning that the cause of the error is always present. For example, it could be a badly calibrated temperature gauge that reports every reading 5 °C above the true value.
random errors are errors that occur inconsistently. For example, the temperature gauge readings might be affected by random fluctuations in room temperature. Some readings might be above the true value, some might then be below the true value.
sensitivity of equipment used.

Assessing Accuracy

The effect of errors and insensitive equipment can both be captured by calculating the percentage error:

$$\text{% error} = \frac{\text{|experimental value – true value|}}{\text{true value}} \times 100%$$

Generally, measurements are considered accurate when the percentage error is less than 5%. You should always take the context of the experimental into account when assessing accuracy.

While accuracy and validity have different definitions, the two are closely related. Accurate results often suggest that the underlying experiment is valid, as invalid experiments are unlikely to produce accurate results.

In a simple pendulum experiment, if your measurements of the pendulum's period are close to the calculated value, your experiment is accurate. A table showing sample experimental measurements vs accepted values from using the equation above is shown below.

All experimental values in the table above are within 5% of accepted (theoretical) values, they are therefore considered as accurate.

How to Improve Accuracy

Remove systematic errors : for example, if the experiment’s measuring instruments are poorly calibrated, then you should correctly calibrate it before doing the experiment again.
Reduce the influence of random errors : this can be done by having more repetitions in the experiment and reporting the average values. This is because if you have enough of these random errors – some above the true value and some below the true value – then averaging them will make them cancel each other out This brings your average value closer and closer to the true value.
Use More Sensitive Equipments: For example, use a recording to measure time by analysing motion of an object frame by frame, instead of using a stopwatch. The sensitivity of an equipment can be measured by the limit of reading . For example, stopwatches may only measure to the nearest millisecond – that is their limit of reading. But recordings can be analysed to the frame. And, depending on the frame rate of the camera, this could mean measuring to the nearest microsecond.
Obtain More Measurements and Over a Wider Range: In some cases, the relationship between two variables can be more accurately determined by testing over a wider range. For example, in the pendulum experiment, periods when strings of various lengths are used can be measured. In this instance, repeating the experiment does not relate to reliability because we have changed the value of the independent variable tested.

Reliability

Reliability involves the consistency of your results over multiple trials.

Assessing Reliability

The reliability of an experiment can be broken down into the reliability of the procedure and the reliability of the final results.

The reliability of the procedure refers to how consistently the steps of your experiment produce similar results. For example, if an experiment produces the same values every time it is repeated, then it is highly reliable. This can be assessed quantitatively by looking at the spread of measurements, using statistical tests such as greatest deviation from the mean, standard deviations, or z-scores.

Ask yourself: "Is my result reproducible?"

The reliability of results cannot be assessed if there is only one data point or measurement obtained in the experiment. There must be at least 3. When you're repeating the experiment to assess the reliability of its results, you must follow the same steps , use the same value for the independent variable. Results obtained from methods with different steps cannot be assessed for their reliability.

Obtaining only one measurement in an experiment is not enough because it could be affected by errors and have been produced due to pure chance. Repeating the experiment and obtaining the same or similar results will increase your confidence that the results are reproducible (therefore reliable).

In the soft drink experiment, reliability can be assessed by repeating the steps at least three times:

The mass loss measured in all three trials are fairly consistent, suggesting that the reliability of the underly method is high.

The reliability of the final results refers to how consistently your final data points (e.g. average value of repeated trials) point towards the same trend. That is, how close are they all to the trend line? This can be assessed quantitatively using the `R^2` value. `R^2` value ranges between 0 and 1, a value of 0 suggests there is no correlation between data points, and a value of 1 suggests a perfect correlation with no variance from trend line.

In the pendulum experiment, we can calculate the `R^2` value (done in Excel) by using the final average period values measured for each pendulum length.

Here, a `R^2` value of 0.9758 suggests the four average values are fairly close to the overall linear trend line (low variance from trend line). Thus, the results are fairly reliable.

How to Improve Reliability

A common misconception is that increasing the number of trials increases the reliability of the procedure . This is not true. The only way to increase the reliability of the procedure is to revise it. This could mean using instruments that are less susceptible to random errors, which cause measurements to be more variable.

Increasing the number of trials actually increases the reliability of the final results . This is because having more repetitions reduces the influence of random errors and brings the average values closer to the true values. Generally, the closer experimental values are to true values, the closer they are to the true trend. That is, accurate data points are generally reliable and all point towards the same trend.

Reliable but Inaccurate / Invalid

It is important to understand that results from an experiment can be reliable (consistent), but inaccurate (deviate greatly from theoretical values) and/or invalid. In this case, your procedure is reliable, but your final results likely are not.

Examples of Reliability

Using the soft drink example again, if the mass losses measured for three soft drinks (same brand and type of drink) are consistent, then it's reliable.

Using the pendulum example again, if you get similar period measurements every time you repeat the experiment, it’s reliable.

However, in both cases, if the underlying methods are invalid, the consistent results would be invalid and inaccurate (despite being reliable).

Do you have trouble understanding validity, accuracy or reliability in your science experiment or depth study?

Consider getting personalised help from our 1-on-1 mentoring program !

RETURN TO WORKING SCIENTIFICALLY

choosing a selection results in a full page refresh
press the space key then arrow keys to make a selection

Log in using your username and password

Search More Search for this keyword Advanced search
Latest content
Current issue
Write for Us
BMJ Journals More You are viewing from: Google Indexer

Roberta Heale 1 ,
Alison Twycross 2
1 School of Nursing, Laurentian University , Sudbury, Ontario , Canada
2 Faculty of Health and Social Care , London South Bank University , London , UK
Correspondence to : Dr Roberta Heale, School of Nursing, Laurentian University, Ramsey Lake Road, Sudbury, Ontario, Canada P3E2C6; rheale{at}laurentian.ca

https://doi.org/10.1136/eb-2015-102129

Statistics from Altmetric.com

Request permissions.

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Evidence-based practice includes, in part, implementation of the findings of well-conducted quality research studies. So being able to critique quantitative research is an important skill for nurses. Consideration must be given not only to the results of the study but also the rigour of the research. Rigour refers to the extent to which the researchers worked to enhance the quality of the studies. In quantitative research, this is achieved through measurement of the validity and reliability. 1

View inline

Types of validity

The first category is content validity . This category looks at whether the instrument adequately covers all the content that it should with respect to the variable. In other words, does the instrument cover the entire domain related to the variable, or construct it was designed to measure? In an undergraduate nursing course with instruction about public health, an examination with content validity would cover all the content in the course with greater emphasis on the topics that had received greater coverage or more depth. A subset of content validity is face validity , where experts are asked their opinion about whether an instrument measures the concept intended.

Construct validity refers to whether you can draw inferences about test scores related to the concept being studied. For example, if a person has a high score on a survey that measures anxiety, does this person truly have a high degree of anxiety? In another example, a test of knowledge of medications that requires dosage calculations may instead be testing maths knowledge.

There are three types of evidence that can be used to demonstrate a research instrument has construct validity:

Homogeneity—meaning that the instrument measures one construct.

Convergence—this occurs when the instrument measures concepts similar to that of other instruments. Although if there are no similar instruments available this will not be possible to do.

Theory evidence—this is evident when behaviour is similar to theoretical propositions of the construct measured in the instrument. For example, when an instrument measures anxiety, one would expect to see that participants who score high on the instrument for anxiety also demonstrate symptoms of anxiety in their day-to-day lives. 2

The final measure of validity is criterion validity . A criterion is any other instrument that measures the same variable. Correlations can be conducted to determine the extent to which the different instruments measure the same variable. Criterion validity is measured in three ways:

Convergent validity—shows that an instrument is highly correlated with instruments measuring similar variables.

Divergent validity—shows that an instrument is poorly correlated to instruments that measure different variables. In this case, for example, there should be a low correlation between an instrument that measures motivation and one that measures self-efficacy.

Predictive validity—means that the instrument should have high correlations with future criterions. 2 For example, a score of high self-efficacy related to performing a task should predict the likelihood a participant completing the task.

Reliability

Reliability relates to the consistency of a measure. A participant completing an instrument meant to measure motivation should have approximately the same responses each time the test is completed. Although it is not possible to give an exact calculation of reliability, an estimate of reliability can be achieved through different measures. The three attributes of reliability are outlined in table 2 . How each attribute is tested for is described below.

Attributes of reliability

Homogeneity (internal consistency) is assessed using item-to-total correlation, split-half reliability, Kuder-Richardson coefficient and Cronbach's α. In split-half reliability, the results of a test, or instrument, are divided in half. Correlations are calculated comparing both halves. Strong correlations indicate high reliability, while weak correlations indicate the instrument may not be reliable. The Kuder-Richardson test is a more complicated version of the split-half test. In this process the average of all possible split half combinations is determined and a correlation between 0–1 is generated. This test is more accurate than the split-half test, but can only be completed on questions with two answers (eg, yes or no, 0 or 1). 3

Cronbach's α is the most commonly used test to determine the internal consistency of an instrument. In this test, the average of all correlations in every combination of split-halves is determined. Instruments with questions that have more than two responses can be used in this test. The Cronbach's α result is a number between 0 and 1. An acceptable reliability score is one that is 0.7 and higher. 1 , 3

Stability is tested using test–retest and parallel or alternate-form reliability testing. Test–retest reliability is assessed when an instrument is given to the same participants more than once under similar circumstances. A statistical comparison is made between participant's test scores for each of the times they have completed it. This provides an indication of the reliability of the instrument. Parallel-form reliability (or alternate-form reliability) is similar to test–retest reliability except that a different form of the original instrument is given to participants in subsequent tests. The domain, or concepts being tested are the same in both versions of the instrument but the wording of items is different. 2 For an instrument to demonstrate stability there should be a high correlation between the scores each time a participant completes the test. Generally speaking, a correlation coefficient of less than 0.3 signifies a weak correlation, 0.3–0.5 is moderate and greater than 0.5 is strong. 4

Equivalence is assessed through inter-rater reliability. This test includes a process for qualitatively determining the level of agreement between two or more observers. A good example of the process used in assessing inter-rater reliability is the scores of judges for a skating competition. The level of consistency across all judges in the scores given to skating participants is the measure of inter-rater reliability. An example in research is when researchers are asked to give a score for the relevancy of each item on an instrument. Consistency in their scores relates to the level of inter-rater reliability of the instrument.

Determining how rigorously the issues of reliability and validity have been addressed in a study is an essential component in the critique of research as well as influencing the decision about whether to implement of the study findings into nursing practice. In quantitative studies, rigour is determined through an evaluation of the validity and reliability of the tools or instruments utilised in the study. A good quality research study will provide evidence of how all these factors have been addressed. This will help you to assess the validity and reliability of the research and help you decide whether or not you should apply the findings in your area of clinical practice.

Lobiondo-Wood G ,
Shuttleworth M
↵ Laerd Statistics . Determining the correlation coefficient . 2013 . https://statistics.laerd.com/premium/pc/pearson-correlation-in-spss-8.php

Twitter Follow Roberta Heale at @robertaheale and Alison Twycross at @alitwy

Competing interests None declared.

Read the full text or download the PDF:

Reliability and Validity

Issues of research reliability and validity need to be addressed in methodology chapter in a concise manner.

Reliability refers to the extent to which the same answers can be obtained using the same instruments more than one time. In simple terms, if your research is associated with high levels of reliability, then other researchers need to be able to generate the same results, using the same research methods under similar conditions. It is noted that “reliability problems crop up in many forms.

Reliability is a concern every time a single observer is the source of data, because we have no certain guard against the impact of that observer’s subjectivity” (Babbie, 2010, p.158). According to Wilson (2010) reliability issues are most of the time closely associated with subjectivity and once a researcher adopts a subjective approach towards the study, then the level of reliability of the work is going to be compromised.

Validity of research can be explained as an extent at which requirements of scientific research method have been followed during the process of generating research findings. Oliver (2010) considers validity to be a compulsory requirement for all types of studies. There are different forms of research validity and main ones are specified by Cohen et al (2007) as content validity, criterion-related validity, construct validity, internal validity, external validity, concurrent validity and face validity.

Measures to ensure validity of a research include, but not limited to the following points:

a) Appropriate time scale for the study has to be selected;

b) Appropriate methodology has to be chosen, taking into account the characteristics of the study;

c) The most suitable sample method for the study has to be selected;

d) The respondents must not be pressured in any ways to select specific choices among the answer sets.

It is important to understand that although threats to research reliability and validity can never be totally eliminated, however researchers need to strive to minimize this threat as much as possible.

John Dudovskiy

Babbie, E. R. (2010) “The Practice of Social Research” Cengage Learning
Cohen, L., Manion, L., Morrison, K, & Morrison, R.B. (2007) “Research methods in education” Routledge
Oliver, V, 2010, 301 Smart Answers to Tough Business Etiquette Questions, Skyhorse Publishing, New York USA
Wilson, J. (2010) “Essentials of Business Research: A Guide to Doing Your Research Project” SAGE Publications

Want to create or adapt books like this? Learn more about how Pressbooks supports open publishing practices.

Chapter 5: Psychological Measurement

Reliability and Validity of Measurement

Learning Objectives

Define reliability, including the different types and how they are assessed.
Define validity, including the different types and how they are assessed.
Describe the kinds of evidence that would be relevant to assessing the reliability and validity of a particular measure.

Again, measurement involves assigning scores to individuals so that they represent some characteristic of the individuals. But how do researchers know that the scores actually represent the characteristic, especially when it is a construct like intelligence, self-esteem, depression, or working memory capacity? The answer is that they conduct research using the measure to confirm that the scores make sense based on their understanding of the construct being measured. This is an extremely important point. Psychologists do not simply assume that their measures work. Instead, they collect data to demonstrate that they work. If their research does not demonstrate that a measure works, they stop using it.

As an informal example, imagine that you have been dieting for a month. Your clothes seem to be fitting more loosely, and several friends have asked if you have lost weight. If at this point your bathroom scale indicated that you had lost 10 pounds, this would make sense and you would continue to use the scale. But if it indicated that you had gained 10 pounds, you would rightly conclude that it was broken and either fix it or get rid of it. In evaluating a measurement method, psychologists consider two general dimensions: reliability and validity.

Reliability

Reliability refers to the consistency of a measure. Psychologists consider three types of consistency: over time (test-retest reliability), across items (internal consistency), and across different researchers (inter-rater reliability).

Test-Retest Reliability

When researchers measure a construct that they assume to be consistent across time, then the scores they obtain should also be consistent across time. Test-retest reliability is the extent to which this is actually the case. For example, intelligence is generally thought to be consistent across time. A person who is highly intelligent today will be highly intelligent next week. This means that any good measure of intelligence should produce roughly the same scores for this individual next week as it does today. Clearly, a measure that produces highly inconsistent scores over time cannot be a very good measure of a construct that is supposed to be consistent.

Assessing test-retest reliability requires using the measure on a group of people at one time, using it again on the same group of people at a later time, and then looking at test-retest correlation between the two sets of scores. This is typically done by graphing the data in a scatterplot and computing Pearson’s r . Figure 5.2 shows the correlation between two sets of scores of several university students on the Rosenberg Self-Esteem Scale, administered two times, a week apart. Pearson’s r for these data is +.95. In general, a test-retest correlation of +.80 or greater is considered to indicate good reliability.

Score at time 1 is on the x-axis and score at time 2 is on the y-axis, showing fairly consistent scores

Again, high test-retest correlations make sense when the construct being measured is assumed to be consistent over time, which is the case for intelligence, self-esteem, and the Big Five personality dimensions. But other constructs are not assumed to be stable over time. The very nature of mood, for example, is that it changes. So a measure of mood that produced a low test-retest correlation over a period of a month would not be a cause for concern.

Internal Consistency

A second kind of reliability is internal consistency , which is the consistency of people’s responses across the items on a multiple-item measure. In general, all the items on such measures are supposed to reflect the same underlying construct, so people’s scores on those items should be correlated with each other. On the Rosenberg Self-Esteem Scale, people who agree that they are a person of worth should tend to agree that that they have a number of good qualities. If people’s responses to the different items are not correlated with each other, then it would no longer make sense to claim that they are all measuring the same underlying construct. This is as true for behavioural and physiological measures as for self-report measures. For example, people might make a series of bets in a simulated game of roulette as a measure of their level of risk seeking. This measure would be internally consistent to the extent that individual participants’ bets were consistently high or low across trials.

Like test-retest reliability, internal consistency can only be assessed by collecting and analyzing data. One approach is to look at a split-half correlation . This involves splitting the items into two sets, such as the first and second halves of the items or the even- and odd-numbered items. Then a score is computed for each set of items, and the relationship between the two sets of scores is examined. For example, Figure 5.3 shows the split-half correlation between several university students’ scores on the even-numbered items and their scores on the odd-numbered items of the Rosenberg Self-Esteem Scale. Pearson’s r for these data is +.88. A split-half correlation of +.80 or greater is generally considered good internal consistency.

Score on even-numbered items is on the x-axis and score on odd-numbered items is on the y-axis, showing fairly consistent scores

Perhaps the most common measure of internal consistency used by researchers in psychology is a statistic called Cronbach’s α (the Greek letter alpha). Conceptually, α is the mean of all possible split-half correlations for a set of items. For example, there are 252 ways to split a set of 10 items into two sets of five. Cronbach’s α would be the mean of the 252 split-half correlations. Note that this is not how α is actually computed, but it is a correct way of interpreting the meaning of this statistic. Again, a value of +.80 or greater is generally taken to indicate good internal consistency.

Interrater Reliability

Many behavioural measures involve significant judgment on the part of an observer or a rater. Inter-rater reliability is the extent to which different observers are consistent in their judgments. For example, if you were interested in measuring university students’ social skills, you could make video recordings of them as they interacted with another student whom they are meeting for the first time. Then you could have two or more observers watch the videos and rate each student’s level of social skills. To the extent that each participant does in fact have some level of social skills that can be detected by an attentive observer, different observers’ ratings should be highly correlated with each other. Inter-rater reliability would also have been measured in Bandura’s Bobo doll study. In this case, the observers’ ratings of how many acts of aggression a particular child committed while playing with the Bobo doll should have been highly positively correlated. Interrater reliability is often assessed using Cronbach’s α when the judgments are quantitative or an analogous statistic called Cohen’s κ (the Greek letter kappa) when they are categorical.

Validity is the extent to which the scores from a measure represent the variable they are intended to. But how do researchers make this judgment? We have already considered one factor that they take into account—reliability. When a measure has good test-retest reliability and internal consistency, researchers should be more confident that the scores represent what they are supposed to. There has to be more to it, however, because a measure can be extremely reliable but have no validity whatsoever. As an absurd example, imagine someone who believes that people’s index finger length reflects their self-esteem and therefore tries to measure self-esteem by holding a ruler up to people’s index fingers. Although this measure would have extremely good test-retest reliability, it would have absolutely no validity. The fact that one person’s index finger is a centimetre longer than another’s would indicate nothing about which one had higher self-esteem.

Discussions of validity usually divide it into several distinct “types.” But a good way to interpret these types is that they are other kinds of evidence—in addition to reliability—that should be taken into account when judging the validity of a measure. Here we consider three basic kinds: face validity, content validity, and criterion validity.

Face Validity

Face validity is the extent to which a measurement method appears “on its face” to measure the construct of interest. Most people would expect a self-esteem questionnaire to include items about whether they see themselves as a person of worth and whether they think they have good qualities. So a questionnaire that included these kinds of items would have good face validity. The finger-length method of measuring self-esteem, on the other hand, seems to have nothing to do with self-esteem and therefore has poor face validity. Although face validity can be assessed quantitatively—for example, by having a large sample of people rate a measure in terms of whether it appears to measure what it is intended to—it is usually assessed informally.

Face validity is at best a very weak kind of evidence that a measurement method is measuring what it is supposed to. One reason is that it is based on people’s intuitions about human behaviour, which are frequently wrong. It is also the case that many established measures in psychology work quite well despite lacking face validity. The Minnesota Multiphasic Personality Inventory-2 (MMPI-2) measures many personality characteristics and disorders by having people decide whether each of over 567 different statements applies to them—where many of the statements do not have any obvious relationship to the construct that they measure. For example, the items “I enjoy detective or mystery stories” and “The sight of blood doesn’t frighten me or make me sick” both measure the suppression of aggression. In this case, it is not the participants’ literal answers to these questions that are of interest, but rather whether the pattern of the participants’ responses to a series of questions matches those of individuals who tend to suppress their aggression.

Content Validity

Content validity is the extent to which a measure “covers” the construct of interest. For example, if a researcher conceptually defines test anxiety as involving both sympathetic nervous system activation (leading to nervous feelings) and negative thoughts, then his measure of test anxiety should include items about both nervous feelings and negative thoughts. Or consider that attitudes are usually defined as involving thoughts, feelings, and actions toward something. By this conceptual definition, a person has a positive attitude toward exercise to the extent that he or she thinks positive thoughts about exercising, feels good about exercising, and actually exercises. So to have good content validity, a measure of people’s attitudes toward exercise would have to reflect all three of these aspects. Like face validity, content validity is not usually assessed quantitatively. Instead, it is assessed by carefully checking the measurement method against the conceptual definition of the construct.

Criterion Validity

Criterion validity is the extent to which people’s scores on a measure are correlated with other variables (known as criteria ) that one would expect them to be correlated with. For example, people’s scores on a new measure of test anxiety should be negatively correlated with their performance on an important school exam. If it were found that people’s scores were in fact negatively correlated with their exam performance, then this would be a piece of evidence that these scores really represent people’s test anxiety. But if it were found that people scored equally well on the exam regardless of their test anxiety scores, then this would cast doubt on the validity of the measure.

A criterion can be any variable that one has reason to think should be correlated with the construct being measured, and there will usually be many of them. For example, one would expect test anxiety scores to be negatively correlated with exam performance and course grades and positively correlated with general anxiety and with blood pressure during an exam. Or imagine that a researcher develops a new measure of physical risk taking. People’s scores on this measure should be correlated with their participation in “extreme” activities such as snowboarding and rock climbing, the number of speeding tickets they have received, and even the number of broken bones they have had over the years. When the criterion is measured at the same time as the construct, criterion validity is referred to as concurrent validity ; however, when the criterion is measured at some point in the future (after the construct has been measured), it is referred to as predictive validity (because scores on the measure have “predicted” a future outcome).

Criteria can also include other measures of the same construct. For example, one would expect new measures of test anxiety or physical risk taking to be positively correlated with existing measures of the same constructs. This is known as convergent validity .

Assessing convergent validity requires collecting data using the measure. Researchers John Cacioppo and Richard Petty did this when they created their self-report Need for Cognition Scale to measure how much people value and engage in thinking (Cacioppo & Petty, 1982) [1] . In a series of studies, they showed that people’s scores were positively correlated with their scores on a standardized academic achievement test, and that their scores were negatively correlated with their scores on a measure of dogmatism (which represents a tendency toward obedience). In the years since it was created, the Need for Cognition Scale has been used in literally hundreds of studies and has been shown to be correlated with a wide variety of other variables, including the effectiveness of an advertisement, interest in politics, and juror decisions (Petty, Briñol, Loersch, & McCaslin, 2009) [2] .

Discriminant Validity

Discriminant validity , on the other hand, is the extent to which scores on a measure are not correlated with measures of variables that are conceptually distinct. For example, self-esteem is a general attitude toward the self that is fairly stable over time. It is not the same as mood, which is how good or bad one happens to be feeling right now. So people’s scores on a new measure of self-esteem should not be very highly correlated with their moods. If the new measure of self-esteem were highly correlated with a measure of mood, it could be argued that the new measure is not really measuring self-esteem; it is measuring mood instead.

When they created the Need for Cognition Scale, Cacioppo and Petty also provided evidence of discriminant validity by showing that people’s scores were not correlated with certain other variables. For example, they found only a weak correlation between people’s need for cognition and a measure of their cognitive style—the extent to which they tend to think analytically by breaking ideas into smaller parts or holistically in terms of “the big picture.” They also found no correlation between people’s need for cognition and measures of their test anxiety and their tendency to respond in socially desirable ways. All these low correlations provide evidence that the measure is reflecting a conceptually distinct construct.

Key Takeaways

Psychological researchers do not simply assume that their measures work. Instead, they conduct research to show that they work. If they cannot show that they work, they stop using them.
There are two distinct criteria by which researchers evaluate their measures: reliability and validity. Reliability is consistency across time (test-retest reliability), across items (internal consistency), and across researchers (interrater reliability). Validity is the extent to which the scores actually represent the variable they are intended to.
Validity is a judgment based on various types of evidence. The relevant evidence includes the measure’s reliability, whether it covers the construct of interest, and whether the scores it produces are correlated with other variables they are expected to be correlated with and not correlated with variables that are conceptually distinct.
The reliability and validity of a measure is not established by any single study but by the pattern of results across multiple studies. The assessment of reliability and validity is an ongoing process.
Practice: Ask several friends to complete the Rosenberg Self-Esteem Scale. Then assess its internal consistency by making a scatterplot to show the split-half correlation (even- vs. odd-numbered items). Compute Pearson’s r too if you know how.
Discussion: Think back to the last college exam you took and think of the exam as a psychological measure. What construct do you think it was intended to measure? Comment on its face and content validity. What data could you collect to assess its reliability and criterion validity?
Cacioppo, J. T., & Petty, R. E. (1982). The need for cognition. Journal of Personality and Social Psychology, 42 , 116–131. ↵
Petty, R. E, Briñol, P., Loersch, C., & McCaslin, M. J. (2009). The need for cognition. In M. R. Leary & R. H. Hoyle (Eds.), Handbook of individual differences in social behaviour (pp. 318–329). New York, NY: Guilford Press. ↵

The consistency of a measure.

The consistency of a measure over time.

The consistency of a measure on the same group of people at different times.

Consistency of people’s responses across the items on a multiple-item measure.

Method of assessing internal consistency through splitting the items into two sets and examining the relationship between them.

A statistic in which α is the mean of all possible split-half correlations for a set of items.

The extent to which different observers are consistent in their judgments.

The extent to which the scores from a measure represent the variable they are intended to.

The extent to which a measurement method appears to measure the construct of interest.

The extent to which a measure “covers” the construct of interest.

The extent to which people’s scores on a measure are correlated with other variables that one would expect them to be correlated with.

In reference to criterion validity, variables that one would expect to be correlated with the measure.

When the criterion is measured at the same time as the construct.

when the criterion is measured at some point in the future (after the construct has been measured).

When new measures positively correlate with existing measures of the same constructs.

The extent to which scores on a measure are not correlated with measures of variables that are conceptually distinct.

Research Methods in Psychology - 2nd Canadian Edition Copyright © 2015 by Paul C. Price, Rajiv Jhangiani, & I-Chant A. Chiang is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License , except where otherwise noted.

Share This Book

school Campus Bookshelves
menu_book Bookshelves
perm_media Learning Objects
login Login
how_to_reg Request Instructor Account
hub Instructor Commons
Download Page (PDF)
Download Full Book (PDF)
Periodic Table
Physics Constants
Scientific Calculator
Reference & Cite
Tools expand_more
Readability

selected template will load here

This action is not available.

5.13: The Reliability and Validity of Research

Last updated
Save as PDF
Page ID 59853

Learning Objectives

Define reliability and validity

Interpreting Experimental Findings

Once data is collected from both the experimental and the control groups, a statistical analysis is conducted to find out if there are meaningful differences between the two groups. A statistical analysis determines how likely any difference found is due to chance (and thus not meaningful). In psychology, group differences are considered meaningful, or significant, if the odds that these differences occurred by chance alone are 5 percent or less. Stated another way, if we repeated this experiment 100 times, we would expect to find the same results at least 95 times out of 100.

The greatest strength of experiments is the ability to assert that any significant differences in the findings are caused by the independent variable. This occurs because random selection, random assignment, and a design that limits the effects of both experimenter bias and participant expectancy should create groups that are similar in composition and treatment. Therefore, any difference between the groups is attributable to the independent variable, and now we can finally make a causal statement. If we find that watching a violent television program results in more violent behavior than watching a nonviolent program, we can safely say that watching violent television programs causes an increase in the display of violent behavior.

Reporting Research

When psychologists complete a research project, they generally want to share their findings with other scientists. The American Psychological Association (APA) publishes a manual detailing how to write a paper for submission to scientific journals. Unlike an article that might be published in a magazine like Psychology Today, which targets a general audience with an interest in psychology, scientific journals generally publish peer-reviewed journal articles aimed at an audience of professionals and scholars who are actively involved in research themselves.

Link to Learning

The Online Writing Lab (OWL) at Purdue University can walk you through the APA writing guidelines.

A peer-reviewed journal article is read by several other scientists (generally anonymously) with expertise in the subject matter. These peer reviewers provide feedback—to both the author and the journal editor—regarding the quality of the draft. Peer reviewers look for a strong rationale for the research being described, a clear description of how the research was conducted, and evidence that the research was conducted in an ethical manner. They also look for flaws in the study’s design, methods, and statistical analyses. They check that the conclusions drawn by the authors seem reasonable given the observations made during the research. Peer reviewers also comment on how valuable the research is in advancing the discipline’s knowledge. This helps prevent unnecessary duplication of research findings in the scientific literature and, to some extent, ensures that each research article provides new information. Ultimately, the journal editor will compile all of the peer reviewer feedback and determine whether the article will be published in its current state (a rare occurrence), published with revisions, or not accepted for publication.

Peer review provides some degree of quality control for psychological research. Poorly conceived or executed studies can be weeded out, and even well-designed research can be improved by the revisions suggested. Peer review also ensures that the research is described clearly enough to allow other scientists to replicate it, meaning they can repeat the experiment using different samples to determine reliability. Sometimes replications involve additional measures that expand on the original finding. In any case, each replication serves to provide more evidence to support the original research findings. Successful replications of published research make scientists more apt to adopt those findings, while repeated failures tend to cast doubt on the legitimacy of the original article and lead scientists to look elsewhere. For example, it would be a major advancement in the medical field if a published study indicated that taking a new drug helped individuals achieve a healthy weight without changing their diet. But if other scientists could not replicate the results, the original study’s claims would be questioned.

Dig Deeper: The Vaccine-Autism Myth and the Retraction of Published Studies

Some scientists have claimed that routine childhood vaccines cause some children to develop autism, and, in fact, several peer-reviewed publications published research making these claims. Since the initial reports, large-scale epidemiological research has suggested that vaccinations are not responsible for causing autism and that it is much safer to have your child vaccinated than not. Furthermore, several of the original studies making this claim have since been retracted.

A published piece of work can be rescinded when data is called into question because of falsification, fabrication, or serious research design problems. Once rescinded, the scientific community is informed that there are serious problems with the original publication. Retractions can be initiated by the researcher who led the study, by research collaborators, by the institution that employed the researcher, or by the editorial board of the journal in which the article was originally published. In the vaccine-autism case, the retraction was made because of a significant conflict of interest in which the leading researcher had a financial interest in establishing a link between childhood vaccines and autism (Offit, 2008). Unfortunately, the initial studies received so much media attention that many parents around the world became hesitant to have their children vaccinated (Figure 1). For more information about how the vaccine/autism story unfolded, as well as the repercussions of this story, take a look at Paul Offit’s book, Autism’s False Prophets: Bad Science, Risky Medicine, and the Search for a Cure.

A photograph shows a child being given an oral vaccine.

Reliability and Validity

Reliability and validity are two important considerations that must be made with any type of data collection. Reliability refers to the ability to consistently produce a given result. In the context of psychological research, this would mean that any instruments or tools used to collect data do so in consistent, reproducible ways. Unfortunately, being consistent in measurement does not necessarily mean that you have measured something correctly. This is where validity comes into play. Validity refers to the extent to which a given instrument or tool accurately measures what it’s supposed to measure. While any valid measure is by necessity reliable, the reverse is not necessarily true. Researchers strive to use instruments that are both highly reliable and valid.

Query $\PageIndex{1}$

Everyday Connection: How Valid Is the SAT?

Standardized tests like the SAT are supposed to measure an individual’s aptitude for a college education, but how reliable and valid are such tests? Research conducted by the College Board suggests that scores on the SAT have high predictive validity for first-year college students’ GPA (Kobrin, Patterson, Shaw, Mattern, & Barbuti, 2008). In this context, predictive validity refers to the test’s ability to effectively predict the GPA of college freshmen. Given that many institutions of higher education require the SAT for admission, this high degree of predictive validity might be comforting.

However, the emphasis placed on SAT scores in college admissions has generated some controversy on a number of fronts. For one, some researchers assert that the SAT is a biased test that places minority students at a disadvantage and unfairly reduces the likelihood of being admitted into a college (Santelices & Wilson, 2010). Additionally, some research has suggested that the predictive validity of the SAT is grossly exaggerated in how well it is able to predict the GPA of first-year college students. In fact, it has been suggested that the SAT’s predictive validity may be overestimated by as much as 150% (Rothstein, 2004). Many institutions of higher education are beginning to consider de-emphasizing the significance of SAT scores in making admission decisions (Rimer, 2008).

Recent examples of high profile cheating scandals both domestically and abroad have only increased the scrutiny being placed on these types of tests, and as of March 2019, more than 1000 institutions of higher education have either relaxed or eliminated the requirements for SAT or ACT testing for admissions (Strauss, 2019, March 19).

Query $\PageIndex{2}$

Query $\PageIndex{3}$

reliability: consistency and reproducibility of a given result

Licenses and Attributions

CC licensed content, Shared previously

Analyzing Findings. Authored by : OpenStax College. Located at : http://cnx.org/contents/[email protected]:mfArybye@7/Analyzing-Findings . License : CC BY: Attribution . License Terms : Download for free at http://cnx.org/contents/[email protected]

Foundations
Write Paper

Search form

Experiments
Anthropology
Self-Esteem
Social Anxiety

Validity and Reliability

The principles of validity and reliability are fundamental cornerstones of the scientific method.

This article is a part of the guide:

Types of Validity
Definition of Reliability
Content Validity
Construct Validity
External Validity

Browse Full Outline

1 Validity and Reliability
2 Types of Validity
3.1 Population Validity
3.2 Ecological Validity
4 Internal Validity
5.1.1 Concurrent Validity
5.1.2 Predictive Validity
6 Content Validity
7.1 Convergent and Discriminant Validity
8 Face Validity
9 Definition of Reliability
10.1 Reproducibility
10.2 Replication Study
11 Interrater Reliability
12 Internal Consistency Reliability
13 Instrument Reliability

Together, they are at the core of what is accepted as scientific proof, by scientist and philosopher alike.

By following a few basic principles, any experimental design will stand up to rigorous questioning and skepticism.

What is Reliability?

The idea behind reliability is that any significant results must be more than a one-off finding and be inherently repeatable .

Other researchers must be able to perform exactly the same experiment , under the same conditions and generate the same results. This will reinforce the findings and ensure that the wider scientific community will accept the hypothesis .

Without this replication of statistically significant results , the experiment and research have not fulfilled all of the requirements of testability .

This prerequisite is essential to a hypothesis establishing itself as an accepted scientific truth.

For example, if you are performing a time critical experiment, you will be using some type of stopwatch. Generally, it is reasonable to assume that the instruments are reliable and will keep true and accurate time. However, diligent scientists take measurements many times, to minimize the chances of malfunction and maintain validity and reliability.

At the other extreme, any experiment that uses human judgment is always going to come under question.

For example, if observers rate certain aspects, like in Bandura’s Bobo Doll Experiment , then the reliability of the test is compromised. Human judgment can vary wildly between observers , and the same individual may rate things differently depending upon time of day and current mood.

This means that such experiments are more difficult to repeat and are inherently less reliable. Reliability is a necessary ingredient for determining the overall validity of a scientific experiment and enhancing the strength of the results.

Debate between social and pure scientists, concerning reliability, is robust and ongoing.

What is Validity?

Validity encompasses the entire experimental concept and establishes whether the results obtained meet all of the requirements of the scientific research method.

For example, there must have been randomization of the sample groups and appropriate care and diligence shown in the allocation of controls .

Internal validity dictates how an experimental design is structured and encompasses all of the steps of the scientific research method .

Even if your results are great, sloppy and inconsistent design will compromise your integrity in the eyes of the scientific community. Internal validity and reliability are at the core of any experimental design.

External validity is the process of examining the results and questioning whether there are any other possible causal relationships.

Control groups and randomization will lessen external validity problems but no method can be completely successful. This is why the statistical proofs of a hypothesis called significant , not absolute truth.

Any scientific research design only puts forward a possible cause for the studied effect.

There is always the chance that another unknown factor contributed to the results and findings. This extraneous causal relationship may become more apparent, as techniques are refined and honed.

If you have constructed your experiment to contain validity and reliability then the scientific community is more likely to accept your findings.

Eliminating other potential causal relationships, by using controls and duplicate samples, is the best way to ensure that your results stand up to rigorous questioning.

Psychology 101
Flags and Countries
Capitals and Countries

Martyn Shuttleworth (Oct 20, 2008). Validity and Reliability. Retrieved Apr 25, 2024 from Explorable.com: https://explorable.com/validity-and-reliability

You Are Allowed To Copy The Text

The text in this article is licensed under the Creative Commons-License Attribution 4.0 International (CC BY 4.0) .

This means you're free to copy, share and adapt any parts (or all) of the text in the article, as long as you give appropriate credit and provide a link/reference to this page.

That is it. You don't need our permission to copy the article; just include a link/reference back to this page. You can use it freely (with some kind of link), and we're also okay with people reprinting in publications like books, blogs, newsletters, course-material, papers, wikipedia and presentations (with clear attribution).

Internal Validity

Want to stay up to date? Follow us!

Save this course for later.

Don't have time for it all now? No problem, save it as a course and come back to it later.

Footer bottom

Subscribe to our RSS Feed
Like us on Facebook
Follow us on Twitter

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Publications
Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

Advanced Search
Journal List
J Grad Med Educ
v.3(2); 2011 Jun

A Primer on the Validity of Assessment Instruments

1. what is reliability 1.

Reliability refers to whether an assessment instrument gives the same results each time it is used in the same setting with the same type of subjects. Reliability essentially means consistent or dependable results. Reliability is a part of the assessment of validity.

2. What is validity? 1

Validity in research refers to how accurately a study answers the study question or the strength of the study conclusions. For outcome measures such as surveys or tests, validity refers to the accuracy of measurement. Here validity refers to how well the assessment tool actually measures the underlying outcome of interest. Validity is not a property of the tool itself, but rather of the interpretation or specific purpose of the assessment tool with particular settings and learners.

Assessment instruments must be both reliable and valid for study results to be credible. Thus, reliability and validity must be examined and reported, or references cited, for each assessment instrument used to measure study outcomes. Examples of assessments include resident feedback survey, course evaluation, written test, clinical simulation observer ratings, needs assessment survey, and teacher evaluation. Using an instrument with high reliability is not sufficient; other measures of validity are needed to establish the credibility of your study.

3. How is reliability measured? 2 – 4

Reliability can be estimated in several ways; the method will depend upon the type of assessment instrument. Sometimes reliability is referred to as internal validity or internal structure of the assessment tool.

For internal consistency 2 to 3 questions or items are created that measure the same concept, and the difference among the answers is calculated. That is, the correlation among the answers is measured.

Cronbach alpha is a test of internal consistency and frequently used to calculate the correlation values among the answers on your assessment tool. 5 Cronbach alpha calculates correlation among all the variables, in every combination; a high reliability estimate should be as close to 1 as possible.

For test/retest the test should give the same results each time, assuming there are no interval changes in what you are measuring, and they are often measured as correlation, with Pearson r.

Test/retest is a more conservative estimate of reliability than Cronbach alpha, but it takes at least 2 administrations of the tool, whereas Cronbach alpha can be calculated after a single administration. To perform a test/retest, you must be able to minimize or eliminate any change (ie, learning) in the condition you are measuring, between the 2 measurement times. Administer the assessment instrument at 2 separate times for each subject and calculate the correlation between the 2 different measurements.

Interrater reliability is used to study the effect of different raters or observers using the same tool and is generally estimated by percent agreement, kappa (for binary outcomes), or Kendall tau.

Another method uses analysis of variance (ANOVA) to generate a generalizability coefficient, to quantify how much measurement error can be attributed to each potential factor, such as different test items, subjects, raters, dates of administration, and so forth. This model looks at the overall reliability of the results. 6

5. How is the validity of an assessment instrument determined? 4 – 7 , 8

Validity of assessment instruments requires several sources of evidence to build the case that the instrument measures what it is supposed to measure. , 9,10 Determining validity can be viewed as constructing an evidence-based argument regarding how well a tool measures what it is supposed to do. Evidence can be assembled to support, or not support, a specific use of the assessment tool. Evidence can be found in content, response process, relationships to other variables, and consequences.

Content includes a description of the steps used to develop the instrument. Provide information such as who created the instrument (national experts would confer greater validity than local experts, who in turn would have more validity than nonexperts) and other steps that support the instrument has the appropriate content.

Response process includes information about whether the actions or thoughts of the subjects actually match the test and also information regarding training for the raters/observers, instructions for the test-takers, instructions for scoring, and clarity of these materials.

Relationship to other variables includes correlation of the new assessment instrument results with other performance outcomes that would likely be the same. If there is a previously accepted “gold standard” of measurement, correlate the instrument results to the subject's performance on the “gold standard.” In many cases, no “gold standard” exists and comparison is made to other assessments that appear reasonable (eg, in-training examinations, objective structured clinical examinations, rotation “grades,” similar surveys).

Consequences means that if there are pass/fail or cut-off performance scores, those grouped in each category tend to perform the same in other settings. Also, if lower performers receive additional training and their scores improve, this would add to the validity of the instrument.

Different types of instruments need an emphasis on different sources of validity evidence. 7 For example, for observer ratings of resident performance, interrater agreement may be key, whereas for a survey measuring resident stress, relationship to other variables may be more important. For a multiple choice examination, content and consequences may be essential sources of validity evidence. For high-stakes assessments (eg, board examinations), substantial evidence to support the case for validity will be required. 9

There are also other types of validity evidence, which are not discussed here.

6. How can researchers enhance the validity of their assessment instruments?

First, do a literature search and use previously developed outcome measures. If the instrument must be modified for use with your subjects or setting, modify and describe how, in a transparent way. Include sufficient detail to allow readers to understand the potential limitations of this approach.

If no assessment instruments are available, use content experts to create your own and pilot the instrument prior to using it in your study. Test reliability and include as many sources of validity evidence as are possible in your paper. Discuss the limitations of this approach openly.

7. What are the expectations of JGME editors regarding assessment instruments used in graduate medical education research?

JGME editors expect that discussions of the validity of your assessment tools will be explicitly mentioned in your manuscript, in the methods section. If you are using a previously studied tool in the same setting, with the same subjects, and for the same purpose, citing the reference(s) is sufficient. Additional discussion about your adaptation is needed if you (1) have modified previously studied instruments; (2) are using the instrument for different settings, subjects, or purposes; or (3) are using different interpretation or cut-off points. Discuss whether the changes are likely to affect the reliability or validity of the instrument.

Researchers who create novel assessment instruments need to state the development process, reliability measures, pilot results, and any other information that may lend credibility to the use of homegrown instruments. Transparency enhances credibility.

In general, little information can be gleaned from single-site studies using untested assessment instruments; these studies are unlikely to be accepted for publication.

8. What are useful resources for reliability and validity of assessment instruments?

The references for this editorial are a good starting point.

Gail M. Sullivan, MD, MPH, is Editor-in-Chief, Journal of Graduate Medical Education .

IMAGES

Difference between validity and reliability in research
Validity vs reliability as data research quality evaluation outline
Research validity and reliability
How to establish the validity and reliability of qualitative research?
Validity and Reliability in Qualitative Research
The Concepts of Reliability and Validity Explained With Examples

VIDEO

Gottlieb discusses VALID Act as a framework for #AI, IVDs, & LDTs. #Oncology #Research #FriendsDx
MS valid reliable accurate t
Validity vs Reliability || Research ||
Difference between Reliability & Validity in Research
HOW TO MAKE SCIENTIFIC RESEARCH
Here's everything you need to know about multi-measure assessment

COMMENTS

Reliability vs. Validity in Research
Reliability is about the consistency of a measure, and validity is about the accuracy of a measure.opt. It's important to consider reliability and validity when you are creating your research design, planning your methods, and writing up your results, especially in quantitative research. Failing to do so can lead to several types of research ...
Validity & Reliability In Research
In short, validity and reliability are both essential to ensuring that your data collection efforts deliver high-quality, accurate data that help you answer your research questions. So, be sure to always pay careful attention to the validity and reliability of your measurement instruments when collecting and analysing data. As the adage goes ...
Validity, reliability, and generalizability in qualitative research
Keywords: Controversies, generalizability, primary care research, qualitative research, reliability, validity Nature of Qualitative Research versus Quantitative Research The essence of qualitative research is to make sense of and recognize patterns among words in order to build up a meaningful picture without compromising its richness and ...
Reliability and Validity
Reliability refers to the consistency of the measurement. Reliability shows how trustworthy is the score of the test. If the collected data shows the same results after being tested using various methods and sample groups, the information is reliable. If your method has reliability, the results will be valid. Example: If you weigh yourself on a ...
Validity
In other words, research validity concerns whether the conclusions drawn from a study are based on accurate, reliable and relevant data. Validity is a concept used in logic and research methodology to assess the strength of an argument or the quality of a research study. It refers to the extent to which a conclusion or result is supported by ...
Reliability vs Validity in Research
Revised on 10 October 2022. Reliability and validity are concepts used to evaluate the quality of research. They indicate how well a method, technique, or test measures something. Reliability is about the consistency of a measure, and validity is about the accuracy of a measure. It's important to consider reliability and validity when you are ...
Reliability vs Validity: Differences & Examples
Typically, researchers need to collect data using an instrument and evaluate the quality of the measurements. In other words, they conduct an assessment before the primary research to assess reliability and validity. For data to be good enough to allow you to draw meaningful conclusions from a research study, they must be reliable and valid.
Validity in Research: A Guide to Better Results
Validity in research is the ability to conduct an accurate study with the right tools and conditions to yield acceptable and reliable data that can be reproduced. Researchers rely on carefully calibrated tools for precise measurements. However, collecting accurate information can be more of a challenge.
Guide: Understanding Reliability and Validity
Issues of validity and reliability in qualitative research. Journal of Lifelong Learning v4, 51-60. Addresses issues of validity and reliability in qualitative research for education. Discusses philosophical assumptions underlying the concepts of internal validity, reliability, and external validity or generalizability.
Managing evidence-based knowledge: the need for reliable, relevant and
Those who publish and edit research-based evidence should focus on the "3 Rs" of evidence-based communication: reliability, relevance and readability. Evidence is reliable if it can be shown to be highly valid. The methods used to generate it must be explicit and rigorous, or at least the best available.
How to Make Your Research Projects More Valid
Validity refers to what extent the chosen research method measures what it is supposed to measure. Both reliability and validity are essential for a successful research project. For now, we will focus on validity. Validity looks at what you are measuring, how you are measuring it, and if the study measures what you set out to measure.
Validity, Accuracy and Reliability: A Comprehensive Guide
Accuracy refers to how close a measurement is to the true or accepted value. In the analogy, it's how close the arrows come to hitting the bullseye (represents the true or accepted value). Reliability refers to the consistency of a set of measurements. Reliable data can be reproduced under the same conditions.
Validity and reliability in quantitative studies
Evidence-based practice includes, in part, implementation of the findings of well-conducted quality research studies. So being able to critique quantitative research is an important skill for nurses. Consideration must be given not only to the results of the study but also the rigour of the research. Rigour refers to the extent to which the researchers worked to enhance the quality of the studies.
Educator's blueprint: A how‐to guide for collecting validity evidence
Surveys are descriptive assessment tools. Like other assessment tools, the validity and reliability of the data obtained from surveys depend, in large part, on the rigor of the development process. Without validity evidence, data from surveys may lack meaning, leading to uncertainty as to how well the survey truly measures the intended constructs.
Reliability and Validity
Issues of research reliability and validity need to be addressed in methodology chapter in a concise manner.. Reliability refers to the extent to which the same answers can be obtained using the same instruments more than one time. In simple terms, if your research is associated with high levels of reliability, then other researchers need to be able to generate the same results, using the same ...
Reliability and Validity of Measurement
Reliability is consistency across time (test-retest reliability), across items (internal consistency), and across researchers (interrater reliability). Validity is the extent to which the scores actually represent the variable they are intended to. Validity is a judgment based on various types of evidence.
5.13: The Reliability and Validity of Research
Reliability and validity are two important considerations that must be made with any type of data collection. Reliability refers to the ability to consistently produce a given result. In the context of psychological research, this would mean that any instruments or tools used to collect data do so in consistent, reproducible ways.
Validity and Reliability
Internal validity dictates how an experimental design is structured and encompasses all of the steps of the scientific research method. Even if your results are great, sloppy and inconsistent design will compromise your integrity in the eyes of the scientific community. Internal validity and reliability are at the core of any experimental design.
(PDF) Importance of Reliability and Validity in Research
These findings indicate that the CAAS-SF is a valid and reliable instrument for evaluating China's career adaptability. In addition, limitations, issues for further research, and suggestions are ...
Survey Reliability: Models, Methods, and Findings
In (1), J denotes the output of the integration process, s i is the scale value assigned to consideration i, and n is the number of considerations taken into account. The scale values represent the implications of the consideration—that is, the answer it points to—for the particular question. Equation (1) applies equally well when an existing evaluation is the only consideration taken into ...
Validity & Reliability in Research
Validity and reliability in research are distinct, but related concepts. Reliability refers to a study's replicability, while validity refers to a study's accuracy. A study can be repeated many ...
A Primer on the Validity of Assessment Instruments
What is validity? 1. Validity in research refers to how accurately a study answers the study question or the strength of the study conclusions. For outcome measures such as surveys or tests, validity refers to the accuracy of measurement. Here validity refers to how well the assessment tool actually measures the underlying outcome of interest.

Validity & Reliability In Research

Overview: Validity & Reliability

First, The Basics…

What Is Validity?

Need a helping hand?

What Is Reliability?

Recap: Key Takeaways

Psst… there’s more!

You Might Also Like:

Submit a Comment Cancel reply

Reliability and Validity – Definitions, Types & Examples

What is Reliability?

What is the Validity?

Internal Vs. External Validity

Looking for reliable dissertation support?

Threats to Interval Validity

Types of Reliability

Does your Research Methodology Have the Following?

How to Increase Reliability?

How to Increase Validity?

How to Implement Reliability and Validity in your Thesis?

Frequently Asked Questions

What is validity?

What is reliability?

What is reliability in psychology?

What is test retest reliability?

How to improve reliability of an experiment?

What is the difference between reliability and validity?

Are interviews reliable and valid?

Are IQ tests valid and reliable?

Are questionnaires reliable and valid?

You May Also Like

Have a language expert improve your writing

Reliability vs Validity in Research | Differences, Types & Examples

Table of contents

What is reliability?

What is validity?

Prevent plagiarism, run a free check.

Types of reliability

Types of validity

Ensuring validity

Ensuring reliability

Cite this Scribbr article

Is this article helpful?

Fiona Middleton

Validity in research: a guide to measuring the right things

Make research less tedious

Why is validity important in research?

How is reliability measured?

How is validity measured?

What are the common validity threats in research, and how can their effects be minimized or nullified?

How do you maintain validity in research?

Is there a need for validation of the research instrument before its implementation?

Get started today

Editor’s picks

Latest articles

Understanding Reliability and Validity

Reliability

Equivalency Reliability

Stability Reliability

Internal Consistency

Interrater Reliability

Face Validity

Criterion Related Validity

Construct Validity

Content Validity

Difficulties of Achieving Reliability

Comments on a Flawed, Yet Influential Study

Annotated Bibliography

Related Links

Validity, Accuracy and Reliability Explained with Examples

Part 1 – Validity

Part 2 – Accuracy

Part 3 – Reliability

Target Analogy to Understand Accuracy and Reliability

Assessing Validity

How to Improve Validity

Example – Validity in Chemistry Experiment

Example – Validity in Physics Experiment

Assessing Accuracy