a measurement in research

  • Subscribe to journal Subscribe
  • Get new issue alerts Get alerts

Secondary Logo

Journal logo.

Colleague's E-mail is Invalid

Your message has been successfully sent to your colleague.

Save my selection

Measurement in Nursing Research

Curtis, Alexa Colgrove PhD, MPH, FNP, PMHNP; Keeler, Courtney PhD

Alexa Colgrove Curtis is assistant dean and professor of graduate nursing and director of the MPH–DNP dual degree program and Courtney Keeler is an associate professor, both at the University of San Francisco School of Nursing and Health Professions. Contact author: Alexa Colgrove Curtis, [email protected] . Nursing Research, Step by Step is coordinated by Bernadette Capili, PhD, NP-C: [email protected] . The authors have disclosed no potential conflicts of interest, financial or otherwise. A podcast with the authors is available at www.ajnonline.com .

a measurement in research

Editor's note: This is the fourth article in a series on clinical research by nurses. The series is designed to give nurses the knowledge and skills they need to participate in research, step by step. Each column will present the concepts that underpin evidence-based practice—from research design to data interpretation. The articles will be accompanied by a podcast offering more insight and context from the authors. To see all the articles in the series, go to https://links.lww.com/AJN/A204 .

Quantitative research examines associations between research variables as measured through numerical analysis, where study effects (outcomes) are analyzed using statistical techniques. Such techniques include descriptive statistics (for example, sample mean and standard deviation) and inferential statistics, which uses the laws of probability to evaluate for statistically significant differences between sample groups (for example, t test, ANOVA, and regression analysis). Qualitative research explores research questions through an analysis of nonnumerical data sources (for example, text sources collected directly or indirectly by the researcher) and reports outcomes as themes or concepts that describe a phenomenon or experience.

As described in the first installment of this series, “a common goal of clinical research is to understand health and illness and to discover novel methods to detect, diagnose, treat, and prevent disease”; with this in mind, research questions must “focus on clear approaches to measuring or quantifying change or outcome,” the research outcome being the “planned measure to determine the effect of an intervention on the population under study.” 1

In this article, we explore measurement in quantitative research. We will also consider the concepts of validity and reliability as they relate to quantitative research measurement. Qualitative analysis will be considered separately in a future article in this series, as this methodology does not typically use a prescribed mechanism for measurement of research variables.

DEFINING THE VARIABLE OF INTEREST

Measurement in research begins with defining the variables of interest. Often, researchers are interested in exploring how variation in one factor or phenomenon influences variation in another. The dependent variables (outcome variables) in a study reflect the primary phenomenon of interest and the independent variables (or explanatory variables) reflect the factors that are hypothesized to have an impact on the primary phenomenon of interest (the dependent variable). 2 For example, a researcher might rightly hypothesize that body mass index (BMI) influences blood pressure, further hypothesizing that increases in BMI are associated with increases in blood pressure. In a study testing this hypothesis, blood pressure is the dependent variable and BMI is an independent variable.

In identifying the variables of interest in a study, researchers are likely to have ideas of concepts they would like to explore. For instance, among other things the researcher is interested in in the above example is weight. A conceptual definition of a research variable provides a general theoretical understanding of that variable; regarding weight, a person might be considered “thin” or “overweight.” Nevertheless, in moving from theory to practice, the researcher must consider how to operationalize this theoretical definition—that is, the researcher needs to select specific mechanisms for measuring the proscribed variables conceptualized in the study. Thus, an operational definition provides a measurable definition of a variable. Continuing with the above example, BMI would be a means of operationalizing the weight variable, where a person with a BMI of 25 or above is categorized by the Centers for Disease Control and Prevention as overweight. 3 In operationalizing variables, first look in existing evidence-based literature, practices, and professional guidelines. For instance, the researcher might consider measuring depression using the validated and widely utilized Patient Health Questionnaire-9 (PHQ-9) depression assessment scale or assessing longitudinal hyperglycemic risk by using the accepted measurement of glycated hemoglobin (HbA 1c ) level.

MEASUREMENT TOOLS

Researchers rely on measurement tools and instruments to create quantitative assessments of the variables studied. In some cases, direct measurements can be made using biometric measurement instruments to collect physiologic data such as weight, blood pressure, oxygen saturation level, and serum laboratory values. These biometric assessments are considered direct measures . 4 To quantify more abstract concepts, such as mood states, attitudes, and theoretical concepts like “caring,” researchers must consider less obvious proxy measures. Proxy measures constructed to quantify more abstract concepts are considered indirect measures . 4 For instance, Hughes developed an instrument to assess peer group caring during informal peer interactions among undergraduate nursing students. 5 While unable to directly measure the theoretical concept of caring, Hughes was able to construct an indirect proxy assessment using a survey tool.

Indirect, and even direct, measures can be operationalized in several ways. For instance, a researcher may consider operationalizing the concept of depression using the PHQ-9 depression assessment scale, using the Center for Epidemiological Studies Depression scale, or by applying Diagnostic and Statistical Manual of Mental Disorders, Fifth Edition , diagnostic criteria. The study findings may be affected by how a variable is operationalized and which measurement tools are utilized; therefore, researchers should give serious thought to study objectives, sample/target populations, and other important considerations when operationalizing a variable. More specifics on measurement formats and methods of administration will be explored in the next installment of this series.

LEVELS OF MEASUREMENT

Levels of measurement describe the structure of a variable (see Table 1 ).

Variable Type Definition Examples
Nominal Data are grouped into distinct and exclusive categories that cannot be rank ordered.
Ordinal Data are categorized into distinct and exclusive groups that can be placed in rank order.
Interval Data reflect a chronological sequence with equal distances between data points across a continuum but do not contain a true zero value (a zero value does not make sense).
Ratio Data are measured continuously with equal spacing between intervals and include a true zero value.

Nominal level . The lowest form of measurement, the nominal level groups data into distinct and exclusive categories that cannot be rank ordered. 2 Gender identity, race/ethnicity, occupation, geographic location, and clinical diagnoses are all examples of categories that contain nominal level data. This type of variable may also be referred to as a categorical variable. 6

Ordinal level data can also be categorized into distinct and exclusive groups; however, unlike nominal data, ordinal data can be ordered by rank. Likert-type scale variables reflect a classic example of ordinal level data, where responses can be rank ordered by “strongly disagree,” “disagree,” “neutral,” “agree,” and “strongly agree.” The 0-to-10-point pain scale is another example of the ordinal level of measurement. Using this scale, a patient provides a subjective determination of the experience of pain, where 0 reflects no pain and 10 reflects the highest pain threshold. As with all ordinal data, the precise quantitative distance between the descriptor data points is impossible to assess—the differences between a pain determination of 3 and one of 4 and a pain determination of 7 and one of 8 cannot be precisely calculated. Further, the distance between each pain level (for example, jumping from a pain level of 3 to 4 or from a pain level of 7 to 8) is not assumed to be incrementally or objectively equal. 2 Despite these drawbacks, ordinal level data are frequently translated into a numerical expression so they can be analyzed as interval or ratio data. For example, a Likert scale can be translated into a scale ranging from “strongly disagree = 1” to “strongly agree = 5,” allowing for the calculation of a numerical mean satisfaction score.

Interval level data reflect a chronological sequencing of data points with distances that are assumed to be quantifiable and equal in magnitude, such as ambient temperature. As with ambient temperature measured in degrees Fahrenheit, the magnitude of the chronological difference between each data point is assumed to be equal along a continuum of continuous values. Of note, interval data do not include a true and meaningful zero, the total absence of the characteristic being measured. 2 For example, there is no such thing as the absence of temperature.

Ratio level data provide the final and most robust level of measurement. Ratio level data are measured continuously, with equal spacing between intervals and with a true zero. Examples include height, weight, heart rate, and serum laboratory values. A zero value is interpreted as the absence of the characteristic. Once again, researchers should be cautious in defining how a variable is operationalized because the level of measurement will influence the types of statistical analyses that can be performed in the evaluation of study outcomes. Interval and ratio levels of measurement result in the most robust statistical analyses and research results. Statistical analysis techniques will be discussed in more detail later in this series.

MEASUREMENT ERROR

For variables to provide a meaningful and appropriate representation of the underlying concept being measured, data measurement needs to be accurate and precise. Measurement error reflects the difference between the measured and true value of the underlying concept. The value of an individual measurement can be described as follows 7 :

Chance error (or random error) changes from measurement to measurement, while bias (or systematic error) influences “all measurements the same way, pushing them in the same direction.” 7 Chance errors are individually unpredictable and inconsistent and in the long run should cancel each other out. If there is no bias in one's measurement, the individual and exact values should ultimately be equal. Bias, however, is inherent in all models and causes a systematic deviation from the true, underlying value.

Weight offers an excellent example of measurement error. Suppose some patients are weighed in the morning, some in the afternoon; some wear coats while others do not; some have eaten while others have fasted, and so forth. This variation reflects random error—we'd expect this positive and negative, over- and underestimation, to average out once enough patients have been sampled. Further, suppose the scale is incorrectly calibrated, such that it reports that every person weighs five pounds more than her or his actual weight. This result reflects a positive bias in the estimates and will not be corrected no matter how many patients are sampled.

The potential for bias resulting from measurement error falls broadly under the category of information bias —are researchers measuring what they think they are measuring? Information bias is present if the study data collected are somehow incorrect. 8 This can occur because of faulty measurement practices that systematically result in the under- or overvaluation of a measure, as described in the scale example above, or because of systematic misreporting by respondents. There are many forms of information bias, including recall bias, interviewer bias, and misclassification, as well as systematic differences in soliciting, recording, and interpreting information. 8 For instance, consider a study of adolescent sexual behavior in which adolescents are interviewed in the home with parents or guardians present. One might assume that adolescents in these circumstances would underreport the number of sexual partners they have had; as a result, one might expect this systematic underreporting to represent a downward bias in the data collected.

Other forms of bias exist, such as selection bias (if study participants are systematically different from the target population, or population of interest). Selection bias, for instance, does not necessarily affect the internal validity of the study (the ability to collect valid data) but may affect the external validity of the study (can researchers truly generalize findings to the population of interest?). These forms of bias will be described in further detail elsewhere in this series.

Measurement error is study and model specific; it comes in many forms and the type of error affects the level or form of bias. In interpreting results and designing research, researchers need to be aware of potential measurement error, do what they can to minimize bias, and provide a thorough assessment of bias in presenting the limitations of their work.

VALIDITY AND RELIABILITY

Validity refers to the degree to which a measurement accurately represents the underlying concept being measured. Basically, does the test operate as designed? Researchers need to consider the validity of use of the measurement instrument within the context of specific populations. For instance, in Hughes's study of caring among peer groups in undergraduate nursing populations, the author not only had to ensure that her instrument accurately measured caring among peer groups but also needed to verify that this measurement was accurate in nursing undergraduate populations. 5 In this instance, Hughes developed the survey explicitly with undergraduates in mind, making the second point easier to achieve.

Suppose, however, that a researcher wanted to use a version of the survey to gauge peer caring among nursing faculty. Would this be appropriate? Not without first assessing the validity of the survey within the new sample. The validity of an instrument can be assessed in several ways: by having the instrument reviewed by a content expert, comparing the instrument's results with those from an alternative assessment metric, assessing how well the instrument predicts current or future performance for the concept under consideration, and running a factor analysis (a statistical procedure that compares items or subscales within an instrument with each other and with the overall instrument outcome).

Reliability and validity go hand in hand. Reliability reflects the consistency of a measurement tool in reporting variable data. An instrument must be reliable to be valid. The reliability of an instrument can be gauged in three ways:

  • stability (the consistency of outcomes with repeated implementation)
  • interrater reliability (the consistency between different evaluators)
  • internal consistency (the homogeneity of items within a scale as they relate to the measurement of the concept under investigation)

Cronbach α is a statistical procedure to assess instrument reliability by determining the internal consistency of items on a multi-item scale. 4, 9 Internal consistency evaluations examine how closely items on a scale represent the outcome concept under evaluation. Cronbach α scores range from 0.00 to 1.00—the higher the score the better the internal consistency. An acceptable Cronbach α as an evaluation of instrument reliability is often considered to be 0.70; however, a score of 0.80 or higher is preferable.

When choosing a measurement instrument for quantitative research, it is best to select one that has documented validity and reliability; alternatively, the researcher may independently complete and describe an assessment of the instrument's validity and reliability. Evaluation of a research study prior to practice implementation should also include assessment of the validity and reliability of the measurement instrument employed, which should be described within the research article.

This installment of AJN 's nursing research series explores how to measure both research outcomes and factors that are hypothesized to influence outcomes. Careful selection of measurement instruments will enhance the accuracy of research and maximize the ability of the research findings to meaningfully inform nursing practice and improve the well-being of patient populations. The next article in this series will further explore the selection and utilization of measurement instruments in the design and execution of nursing research.

  • Cited Here |
  • Google Scholar

Supplemental Digital Content

  • http://links.lww.com/AJN/A204; [Other] (0 KB)
  • + Favorites
  • View in Gallery

Readers Of this Article Also Read

Sampling design in nursing research, selection of the study participants, interpretive methodologies in qualitative nursing research, introduction to statistical hypothesis testing in nursing research, cross-sectional studies.

Logo for Mavs Open Press

Want to create or adapt books like this? Learn more about how Pressbooks supports open publishing practices.

10.1 What is measurement?

Learning objectives.

Learners will be able to…

  • Define measurement
  • Explain where measurement fits into the process of designing research
  • Apply Kaplan’s three categories to determine the complexity of measuring a given variable

Pre-awareness check (Knowledge)

What do you already know about measuring key variables in your research topic?

In social science, when we use the term  measurement , we mean the process by which we describe and ascribe meaning to the key facts, concepts, or other phenomena that we are investigating. In this chapter, we’ll use the term “concept” to mean an abstraction that has meaning. Concepts can be understood from our own experiences or from particular facts, but they don’t have to be limited to real-life phenomenon. We can have a concept of anything we can imagine or experience such as weightlessness, friendship, or income. Understanding exactly what our concepts mean is necessary in order to measure them.

In research, measurement is a systematic procedure for assigning scores, meanings, and descriptions to concepts so that those scores represent the characteristic of interest. Social scientists can and do measure just about anything you can imagine observing or wanting to study. Of course, some things are easier to observe or measure than others.

Where does measurement fit in the process of designing research?

Table 10.1 is intended as a partial review and outlines the general process researchers can follow to get from problem formulation to data collection, including measurement. Use the drop down feature in the table to view the examples for each component of the research process. Keep in mind that this process is iterative. For example, you may find something in your literature review that leads you to refine your conceptualizations, or you may discover as you attempt to conceptually define your terms that you need to return back to the literature for further information. Accordingly, this table should be seen as a suggested path to take rather than an inflexible rule about how research must be conducted.

Table 10.1. Components of the Research Process from Problem Formulation to Data Collection. Note. Information on attachment theory in this table came from: Bowlby, J. (1978). Attachment theory and its therapeutic implications. Adolescent Psychiatry, 6 , 5-33

Categories of concepts that social scientists measure

In 1964, philosopher Abraham Kaplan (1964) [1] wrote The Conduct of Inquiry , which has been cited over 8,500 times. [2] In his text, Kaplan describes different categories of things that behavioral scientists observe. One of those categories, which Kaplan called “observational terms,” is probably the simplest to measure in social science. Observational terms are simple concepts. They are the sorts of things that we can see with the naked eye simply by looking at them. Kaplan roughly defines them as concepts that are easy to identify and verify through direct observation. If, for example, we wanted to know how the conditions of playgrounds differ across different neighborhoods, we could directly observe the variety, amount, and condition of equipment at various playgrounds.

Indirect observables , on the other hand, are less straightforward concepts to assess. In Kaplan’s framework, they are conditions that are subtle and complex that we must use existing knowledge and intuition to define. If we conducted a study for which we wished to know a person’s income, we’d probably have to ask them their income, perhaps in an interview or a survey. Thus, we have observed income, even if it has only been observed indirectly. Birthplace might be another indirect observable. We can ask study participants where they were born, but chances are good we won’t have directly observed any of those people being born in the locations they report.

Sometimes the concepts that we are interested in are more complex and more abstract than observational terms or indirect observables. Because they are complex, constructs generally consist of more than one concept. Let’s take for example, the construct “bureaucracy.” We know this term has something to do with hierarchy, organizations, and how they operate but measuring such a construct is trickier than measuring something like a person’s income because of the complexity involved. Here’s another construct: racism. What is racism? How would you measure it? The constructs of racism and bureaucracy represent constructs whose meanings we have come to agree on.

Though we may not be able to observe constructs directly, we can observe their components. In Kaplan’s categorization, constructs are concepts that are “not observational either directly or indirectly” (Kaplan, 1964, p. 55), [3] but they can be defined based on observables. An example would be measuring the construct of depression. A diagnosis of depression can be made through the DSM-V which includes diagnostic criteria of fatigue, poor concentration, etc. Each of these components of depression can be observed indirectly. We are able to measure constructs by defining them in terms of what we can observe. Though we may not be able to observe them, we can observe their components.

TRACK 1 (IF YOU ARE CREATING A RESEARCH PROPOSAL FOR THIS CLASS):

Look at the variables in your research question.

  • Classify them as direct observables, indirect observables, or constructs.
  • Do you think measuring them will be easy or hard?
  • What are your first thoughts about how to measure each variable? No wrong answers here, just write down a thought about each variable.

TRACK 2 (IF YOU AREN’T CREATING A RESEARCH PROPOSAL FOR THIS CLASS): 

You are interested in studying older adults’ social-emotional well-being. Specifically, you would like to research the impact on levels of older adult loneliness of an intervention that pairs older adults living in assisted living communities with university student volunteers for a weekly conversation.

Develop a working research question for this topic. Then, look at the variables in your research question.

  • Kaplan, A. (1964). The conduct of inquiry: Methodology for behavioral science. San Francisco, CA: Chandler Publishing Company. ↵
  • Earl Babbie offers a more detailed discussion of Kaplan’s work in his text. You can read it in: Babbie, E. (2010). The practice of social research (12th ed.). Belmont, CA: Wadsworth. ↵
  • Kaplan, A. (1964). The conduct of inquiry: Methodology for behavioral science . San Francisco, CA: Chandler Publishing Company. ↵

The process by which we describe and ascribe meaning to the key facts, concepts, or other phenomena under investigation in a research study.

In measurement, conditions that are easy to identify and verify through direct observation.

things that require subtle and complex observations to measure, perhaps we must use existing knowledge and intuition to define.

Conditions that are not directly observable and represent states of being, experiences, and ideas.

Doctoral Research Methods in Social Work Copyright © by Mavs Open Press. All Rights Reserved.

Share This Book

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Review Article
  • Published: 01 June 2023

Data, measurement and empirical methods in the science of science

  • Lu Liu 1 , 2 , 3 , 4 ,
  • Benjamin F. Jones   ORCID: orcid.org/0000-0001-9697-9388 1 , 2 , 3 , 5 , 6 ,
  • Brian Uzzi   ORCID: orcid.org/0000-0001-6855-2854 1 , 2 , 3 &
  • Dashun Wang   ORCID: orcid.org/0000-0002-7054-2206 1 , 2 , 3 , 7  

Nature Human Behaviour volume  7 ,  pages 1046–1058 ( 2023 ) Cite this article

19k Accesses

18 Citations

116 Altmetric

Metrics details

  • Scientific community

The advent of large-scale datasets that trace the workings of science has encouraged researchers from many different disciplinary backgrounds to turn scientific methods into science itself, cultivating a rapidly expanding ‘science of science’. This Review considers this growing, multidisciplinary literature through the lens of data, measurement and empirical methods. We discuss the purposes, strengths and limitations of major empirical approaches, seeking to increase understanding of the field’s diverse methodologies and expand researchers’ toolkits. Overall, new empirical developments provide enormous capacity to test traditional beliefs and conceptual frameworks about science, discover factors associated with scientific productivity, predict scientific outcomes and design policies that facilitate scientific progress.

Similar content being viewed by others

a measurement in research

SciSciNet: A large-scale open data lake for the science of science research

a measurement in research

A dataset for measuring the impact of research data and their curation

a measurement in research

Envisioning a “science diplomacy 2.0”: on data, global challenges, and multi-layered networks

Scientific advances are a key input to rising standards of living, health and the capacity of society to confront grand challenges, from climate change to the COVID-19 pandemic 1 , 2 , 3 . A deeper understanding of how science works and where innovation occurs can help us to more effectively design science policy and science institutions, better inform scientists’ own research choices, and create and capture enormous value for science and humanity. Building on these key premises, recent years have witnessed substantial development in the ‘science of science’ 4 , 5 , 6 , 7 , 8 , 9 , which uses large-scale datasets and diverse computational toolkits to unearth fundamental patterns behind scientific production and use.

The idea of turning scientific methods into science itself is long-standing. Since the mid-20th century, researchers from different disciplines have asked central questions about the nature of scientific progress and the practice, organization and impact of scientific research. Building on these rich historical roots, the field of the science of science draws upon many disciplines, ranging from information science to the social, physical and biological sciences to computer science, engineering and design. The science of science closely relates to several strands and communities of research, including metascience, scientometrics, the economics of science, research on research, science and technology studies, the sociology of science, metaknowledge and quantitative science studies 5 . There are noticeable differences between some of these communities, mostly around their historical origins and the initial disciplinary composition of researchers forming these communities. For example, metascience has its origins in the clinical sciences and psychology, and focuses on rigour, transparency, reproducibility and other open science-related practices and topics. The scientometrics community, born in library and information sciences, places a particular emphasis on developing robust and responsible measures and indicators for science. Science and technology studies engage the history of science and technology, the philosophy of science, and the interplay between science, technology and society. The science of science, which has its origins in physics, computer science and sociology, takes a data-driven approach and emphasizes questions on how science works. Each of these communities has made fundamental contributions to understanding science. While they differ in their origins, these differences pale in comparison to the overarching, common interest in understanding the practice of science and its societal impact.

Three major developments have encouraged rapid advances in the science of science. The first is in data 9 : modern databases include millions of research articles, grant proposals, patents and more. This windfall of data traces scientific activity in remarkable detail and at scale. The second development is in measurement: scholars have used data to develop many new measures of scientific activities and examine theories that have long been viewed as important but difficult to quantify. The third development is in empirical methods: thanks to parallel advances in data science, network science, artificial intelligence and econometrics, researchers can study relationships, make predictions and assess science policy in powerful new ways. Together, new data, measurements and methods have revealed fundamental new insights about the inner workings of science and scientific progress itself.

With multiple approaches, however, comes a key challenge. As researchers adhere to norms respected within their disciplines, their methods vary, with results often published in venues with non-overlapping readership, fragmenting research along disciplinary boundaries. This fragmentation challenges researchers’ ability to appreciate and understand the value of work outside of their own discipline, much less to build directly on it for further investigations.

Recognizing these challenges and the rapidly developing nature of the field, this paper reviews the empirical approaches that are prevalent in this literature. We aim to provide readers with an up-to-date understanding of the available datasets, measurement constructs and empirical methodologies, as well as the value and limitations of each. Owing to space constraints, this Review does not cover the full technical details of each method, referring readers to related guides to learn more. Instead, we will emphasize why a researcher might favour one method over another, depending on the research question.

Beyond a positive understanding of science, a key goal of the science of science is to inform science policy. While this Review mainly focuses on empirical approaches, with its core audience being researchers in the field, the studies reviewed are also germane to key policy questions. For example, what is the appropriate scale of scientific investment, in what directions and through what institutions 10 , 11 ? Are public investments in science aligned with public interests 12 ? What conditions produce novel or high-impact science 13 , 14 , 15 , 16 , 17 , 18 , 19 , 20 ? How do the reward systems of science influence the rate and direction of progress 13 , 21 , 22 , 23 , 24 , and what governs scientific reproducibility 25 , 26 , 27 ? How do contributions evolve over a scientific career 28 , 29 , 30 , 31 , 32 , and how may diversity among scientists advance scientific progress 33 , 34 , 35 , among other questions relevant to science policy 36 , 37 .

Overall, this review aims to facilitate entry to science of science research, expand researcher toolkits and illustrate how diverse research approaches contribute to our collective understanding of science. Section 2 reviews datasets and data linkages. Section 3 reviews major measurement constructs in the science of science. Section 4 considers a range of empirical methods, focusing on one study to illustrate each method and briefly summarizing related examples and applications. Section 5 concludes with an outlook for the science of science.

Historically, data on scientific activities were difficult to collect and were available in limited quantities. Gathering data could involve manually tallying statistics from publications 38 , 39 , interviewing scientists 16 , 40 , or assembling historical anecdotes and biographies 13 , 41 . Analyses were typically limited to a specific domain or group of scientists. Today, massive datasets on scientific production and use are at researchers’ fingertips 42 , 43 , 44 . Armed with big data and advanced algorithms, researchers can now probe questions previously not amenable to quantification and with enormous increases in scope and scale, as detailed below.

Publication datasets cover papers from nearly all scientific disciplines, enabling analyses of both general and domain-specific patterns. Commonly used datasets include the Web of Science (WoS), PubMed, CrossRef, ORCID, OpenCitations, Dimensions and OpenAlex. Datasets incorporating papers’ text (CORE) 45 , 46 , 47 , data entities (DataCite) 48 , 49 and peer review reports (Publons) 33 , 50 , 51 have also become available. These datasets further enable novel measurement, for example, representations of a paper’s content 52 , 53 , novelty 15 , 54 and interdisciplinarity 55 .

Notably, databases today capture more diverse aspects of science beyond publications, offering a richer and more encompassing view of research contexts and of researchers themselves (Fig. 1 ). For example, some datasets trace research funding to the specific publications these investments support 56 , 57 , allowing high-scale studies of the impact of funding on productivity and the return on public investment. Datasets incorporating job placements 58 , 59 , curriculum vitae 21 , 59 and scientific prizes 23 offer rich quantitative evidence on the social structure of science. Combining publication profiles with mentorship genealogies 60 , 61 , dissertations 34 and course syllabi 62 , 63 provides insights on mentoring and cultivating talent.

figure 1

This figure presents commonly used data types in science of science research, information contained in each data type and examples of data sources. Datasets in the science of science research have not only grown in scale but have also expanded beyond publications to integrate upstream funding investments and downstream applications that extend beyond science itself.

Finally, today’s scope of data extends beyond science to broader aspects of society. Altmetrics 64 captures news media and social media mentions of scientific articles. Other databases incorporate marketplace uses of science, including through patents 10 , pharmaceutical clinical trials and drug approvals 65 , 66 . Policy documents 67 , 68 help us to understand the role of science in the halls of government 69 and policy making 12 , 68 .

While datasets of the modern scientific enterprise have grown exponentially, they are not without limitations. As is often the case for data-driven research, drawing conclusions from specific data sources requires scrutiny and care. Datasets are typically based on published work, which may favour easy-to-publish topics over important ones (the streetlight effect) 70 , 71 . The publication of negative results is also rare (the file drawer problem) 72 , 73 . Meanwhile, English language publications account for over 90% of articles in major data sources, with limited coverage of non-English journals 74 . Publication datasets may also reflect biases in data collection across research institutions or demographic groups. Despite the open science movement, many datasets require paid subscriptions, which can create inequality in data access. Creating more open datasets for the science of science, such as OpenAlex, may not only improve the robustness and replicability of empirical claims but also increase entry to the field.

As today’s datasets become larger in scale and continue to integrate new dimensions, they offer opportunities to unveil the inner workings and external impacts of science in new ways. They can enable researchers to reach beyond previous limitations while conducting original studies of new and long-standing questions about the sciences.

Measurement

Here we discuss prominent measurement approaches in the science of science, including their purposes and limitations.

Modern publication databases typically include data on which articles and authors cite other papers and scientists. These citation linkages have been used to engage core conceptual ideas in scientific research. Here we consider two common measures based on citation information: citation counts and knowledge flows.

First, citation counts are commonly used indicators of impact. The term ‘indicator’ implies that it only approximates the concept of interest. A citation count is defined as how many times a document is cited by subsequent documents and can proxy for the importance of research papers 75 , 76 as well as patented inventions 77 , 78 , 79 . Rather than treating each citation equally, measures may further weight the importance of each citation, for example by using the citation network structure to produce centrality 80 , PageRank 81 , 82 or Eigenfactor indicators 83 , 84 .

Citation-based indicators have also faced criticism 84 , 85 . Citation indicators necessarily oversimplify the construct of impact, often ignoring heterogeneity in the meaning and use of a particular reference, the variations in citation practices across fields and institutional contexts, and the potential for reputation and power structures in science to influence citation behaviour 86 , 87 . Researchers have started to understand more nuanced citation behaviours ranging from negative citations 86 to citation context 47 , 88 , 89 . Understanding what a citation actually measures matters in interpreting and applying many research findings in the science of science. Evaluations relying on citation-based indicators rather than expert judgements raise questions regarding misuse 90 , 91 , 92 . Given the importance of developing indicators that can reliably quantify and evaluate science, the scientometrics community has been working to provide guidance for responsible citation practices and assessment 85 .

Second, scientists use citations to trace knowledge flows. Each citation in a paper is a link to specific previous work from which we can proxy how new discoveries draw upon existing ideas 76 , 93 and how knowledge flows between fields of science 94 , 95 , research institutions 96 , regions and nations 97 , 98 , 99 , and individuals 81 . Combinations of citation linkages can also approximate novelty 15 , disruptiveness 17 , 100 and interdisciplinarity 55 , 95 , 101 , 102 . A rapidly expanding body of work further examines citations to scientific articles from other domains (for example, patents, clinical drug trials and policy documents) to understand the applied value of science 10 , 12 , 65 , 66 , 103 , 104 , 105 .

Individuals

Analysing individual careers allows researchers to answer questions such as: How do we quantify individual scientific productivity? What is a typical career lifecycle? How are resources and credits allocated across individuals and careers? A scholar’s career can be examined through the papers they publish 30 , 31 , 106 , 107 , 108 , with attention to career progression and mobility, publication counts and citation impact, as well as grant funding 24 , 109 , 110 and prizes 111 , 112 , 113 ,

Studies of individual impact focus on output, typically approximated by the number of papers a researcher publishes and citation indicators. A popular measure for individual impact is the h -index 114 , which takes both volume and per-paper impact into consideration. Specifically, a scientist is assigned the largest value h such that they have h papers that were each cited at least h times. Later studies build on the idea of the h -index and propose variants to address limitations 115 , these variants ranging from emphasizing highly cited papers in a career 116 , to field differences 117 and normalizations 118 , to the relative contribution of an individual in collaborative works 119 .

To study dynamics in output over the lifecycle, individuals can be studied according to age, career age or the sequence of publications. A long-standing literature has investigated the relationship between age and the likelihood of outstanding achievement 28 , 106 , 111 , 120 , 121 . Recent studies further decouple the relationship between age, publication volume and per-paper citation, and measure the likelihood of producing highly cited papers in the sequence of works one produces 30 , 31 .

As simple as it sounds, representing careers using publication records is difficult. Collecting the full publication list of a researcher is the foundation to study individuals yet remains a key challenge, requiring name disambiguation techniques to match specific works to specific researchers. Although algorithms are increasingly capable at identifying millions of career profiles 122 , they vary in accuracy and robustness. ORCID can help to alleviate the problem by offering researchers the opportunity to create, maintain and update individual profiles themselves, and it goes beyond publications to collect broader outputs and activities 123 . A second challenge is survivorship bias. Empirical studies tend to focus on careers that are long enough to afford statistical analyses, which limits the applicability of the findings to scientific careers as a whole. A third challenge is the breadth of scientists’ activities, where focusing on publications ignores other important contributions such as mentorship and teaching, service (for example, refereeing papers, reviewing grant proposals and editing journals) or leadership within their organizations. Although researchers have begun exploring these dimensions by linking individual publication profiles with genealogical databases 61 , 124 , dissertations 34 , grants 109 , curriculum vitae 21 and acknowledgements 125 , scientific careers beyond publication records remain under-studied 126 , 127 . Lastly, citation-based indicators only serve as an approximation of individual performance with similar limitations as discussed above. The scientific community has called for more appropriate practices 85 , 128 , ranging from incorporating expert assessment of research contributions to broadening the measures of impact beyond publications.

Over many decades, science has exhibited a substantial and steady shift away from solo authorship towards coauthorship, especially among highly cited works 18 , 129 , 130 . In light of this shift, a research field, the science of team science 131 , 132 , has emerged to study the mechanisms that facilitate or hinder the effectiveness of teams. Team size can be proxied by the number of coauthors on a paper, which has been shown to predict distinctive types of advance: whereas larger teams tend to develop ideas, smaller teams tend to disrupt current ways of thinking 17 . Team characteristics can be inferred from coauthors’ backgrounds 133 , 134 , 135 , allowing quantification of a team’s diversity in terms of field, age, gender or ethnicity. Collaboration networks based on coauthorship 130 , 136 , 137 , 138 , 139 offer nuanced network-based indicators to understand individual and institutional collaborations.

However, there are limitations to using coauthorship alone to study teams 132 . First, coauthorship can obscure individual roles 140 , 141 , 142 , which has prompted institutional responses to help to allocate credit, including authorship order and individual contribution statements 56 , 143 . Second, coauthorship does not reflect the complex dynamics and interactions between team members that are often instrumental for team success 53 , 144 . Third, collaborative contributions can extend beyond coauthorship in publications to include members of a research laboratory 145 or co-principal investigators (co-PIs) on a grant 146 . Initiatives such as CRediT may help to address some of these issues by recording detailed roles for each contributor 147 .

Institutions

Research institutions, such as departments, universities, national laboratories and firms, encompass wider groups of researchers and their corresponding outputs. Institutional membership can be inferred from affiliations listed on publications or patents 148 , 149 , and the output of an institution can be aggregated over all its affiliated researchers 150 . Institutional research information systems (CRIS) contain more comprehensive research outputs and activities from employees.

Some research questions consider the institution as a whole, investigating the returns to research and development investment 104 , inequality of resource allocation 22 and the flow of scientists 21 , 148 , 149 . Other questions focus on institutional structures as sources of research productivity by looking into the role of peer effects 125 , 151 , 152 , 153 , how institutional policies impact research outcomes 154 , 155 and whether interdisciplinary efforts foster innovation 55 . Institution-oriented measurement faces similar limitations as with analyses of individuals and teams, including name disambiguation for a given institution and the limited capacity of formal publication records to characterize the full range of relevant institutional outcomes. It is also unclear how to allocate credit among multiple institutions associated with a paper. Moreover, relevant institutional employees extend beyond publishing researchers: interns, technicians and administrators all contribute to research endeavours 130 .

In sum, measurements allow researchers to quantify scientific production and use across numerous dimensions, but they also raise questions of construct validity: Does the proposed metric really reflect what we want to measure? Testing the construct’s validity is important, as is understanding a construct’s limits. Where possible, using alternative measurement approaches, or qualitative methods such as interviews and surveys, can improve measurement accuracy and the robustness of findings.

Empirical methods

In this section, we review two broad categories of empirical approaches (Table 1 ), each with distinctive goals: (1) to discover, estimate and predict empirical regularities; and (2) to identify causal mechanisms. For each method, we give a concrete example to help to explain how the method works, summarize related work for interested readers, and discuss contributions and limitations.

Descriptive and predictive approaches

Empirical regularities and generalizable facts.

The discovery of empirical regularities in science has had a key role in driving conceptual developments and the directions of future research. By observing empirical patterns at scale, researchers unveil central facts that shape science and present core features that theories of scientific progress and practice must explain. For example, consider citation distributions. de Solla Price first proposed that citation distributions are fat-tailed 39 , indicating that a few papers have extremely high citations while most papers have relatively few or even no citations at all. de Solla Price proposed that citation distribution was a power law, while researchers have since refined this view to show that the distribution appears log-normal, a nearly universal regularity across time and fields 156 , 157 . The fat-tailed nature of citation distributions and its universality across the sciences has in turn sparked substantial theoretical work that seeks to explain this key empirical regularity 20 , 156 , 158 , 159 .

Empirical regularities are often surprising and can contest previous beliefs of how science works. For example, it has been shown that the age distribution of great achievements peaks in middle age across a wide range of fields 107 , 121 , 160 , rejecting the common belief that young scientists typically drive breakthroughs in science. A closer look at the individual careers also indicates that productivity patterns vary widely across individuals 29 . Further, a scholar’s highest-impact papers come at a remarkably constant rate across the sequence of their work 30 , 31 .

The discovery of empirical regularities has had important roles in shaping beliefs about the nature of science 10 , 45 , 161 , 162 , sources of breakthrough ideas 15 , 163 , 164 , 165 , scientific careers 21 , 29 , 126 , 127 , the network structure of ideas and scientists 23 , 98 , 136 , 137 , 138 , 139 , 166 , gender inequality 57 , 108 , 126 , 135 , 143 , 167 , 168 , and many other areas of interest to scientists and science institutions 22 , 47 , 86 , 97 , 102 , 105 , 134 , 169 , 170 , 171 . At the same time, care must be taken to ensure that findings are not merely artefacts due to data selection or inherent bias. To differentiate meaningful patterns from spurious ones, it is important to stress test the findings through different selection criteria or across non-overlapping data sources.

Regression analysis

When investigating correlations among variables, a classic method is regression, which estimates how one set of variables explains variation in an outcome of interest. Regression can be used to test explicit hypotheses or predict outcomes. For example, researchers have investigated whether a paper’s novelty predicts its citation impact 172 . Adding additional control variables to the regression, one can further examine the robustness of the focal relationship.

Although regression analysis is useful for hypothesis testing, it bears substantial limitations. If the question one wishes to ask concerns a ‘causal’ rather than a correlational relationship, regression is poorly suited to the task as it is impossible to control for all the confounding factors. Failing to account for such ‘omitted variables’ can bias the regression coefficient estimates and lead to spurious interpretations. Further, regression models often have low goodness of fit (small R 2 ), indicating that the variables considered explain little of the outcome variation. As regressions typically focus on a specific relationship in simple functional forms, regressions tend to emphasize interpretability rather than overall predictability. The advent of predictive approaches powered by large-scale datasets and novel computational techniques offers new opportunities for modelling complex relationships with stronger predictive power.

Mechanistic models

Mechanistic modelling is an important approach to explaining empirical regularities, drawing from methods primarily used in physics. Such models predict macro-level regularities of a system by modelling micro-level interactions among basic elements with interpretable and modifiable formulars. While theoretical by nature, mechanistic models in the science of science are often empirically grounded, and this approach has developed together with the advent of large-scale, high-resolution data.

Simplicity is the core value of a mechanistic model. Consider for example, why citations follow a fat-tailed distribution. de Solla Price modelled the citing behaviour as a cumulative advantage process on a growing citation network 159 and found that if the probability a paper is cited grows linearly with its existing citations, the resulting distribution would follow a power law, broadly aligned with empirical observations. The model is intentionally simplified, ignoring myriad factors. Yet the simple cumulative advantage process is by itself sufficient in explaining a power law distribution of citations. In this way, mechanistic models can help to reveal key mechanisms that can explain observed patterns.

Moreover, mechanistic models can be refined as empirical evidence evolves. For example, later investigations showed that citation distributions are better characterized as log-normal 156 , 173 , prompting researchers to introduce a fitness parameter to encapsulate the inherent differences in papers’ ability to attract citations 174 , 175 . Further, older papers are less likely to be cited than expected 176 , 177 , 178 , motivating more recent models 20 to introduce an additional aging effect 179 . By combining the cumulative advantage, fitness and aging effects, one can already achieve substantial predictive power not just for the overall properties of the system but also the citation dynamics of individual papers 20 .

In addition to citations, mechanistic models have been developed to understand the formation of collaborations 136 , 180 , 181 , 182 , 183 , knowledge discovery and diffusion 184 , 185 , topic selection 186 , 187 , career dynamics 30 , 31 , 188 , 189 , the growth of scientific fields 190 and the dynamics of failure in science and other domains 178 .

At the same time, some observers have argued that mechanistic models are too simplistic to capture the essence of complex real-world problems 191 . While it has been a cornerstone for the natural sciences, representing social phenomena in a limited set of mathematical equations may miss complexities and heterogeneities that make social phenomena interesting in the first place. Such concerns are not unique to the science of science, as they represent a broader theme in computational social sciences 192 , 193 , ranging from social networks 194 , 195 to human mobility 196 , 197 to epidemics 198 , 199 . Other observers have questioned the practical utility of mechanistic models and whether they can be used to guide decisions and devise actionable policies. Nevertheless, despite these limitations, several complex phenomena in the science of science are well captured by simple mechanistic models, showing a high degree of regularity beneath complex interacting systems and providing powerful insights about the nature of science. Mixing such modelling with other methods could be particularly fruitful in future investigations.

Machine learning

The science of science seeks in part to forecast promising directions for scientific research 7 , 44 . In recent years, machine learning methods have substantially advanced predictive capabilities 200 , 201 and are playing increasingly important parts in the science of science. In contrast to the previous methods, machine learning does not emphasize hypotheses or theories. Rather, it leverages complex relationships in data and optimizes goodness of fit to make predictions and categorizations.

Traditional machine learning models include supervised, semi-supervised and unsupervised learning. The model choice depends on data availability and the research question, ranging from supervised models for citation prediction 202 , 203 to unsupervised models for community detection 204 . Take for example mappings of scientific knowledge 94 , 205 , 206 . The unsupervised method applies network clustering algorithms to map the structures of science. Related visualization tools make sense of clusters from the underlying network, allowing observers to see the organization, interactions and evolution of scientific knowledge. More recently, supervised learning, and deep neural networks in particular, have witnessed especially rapid developments 207 . Neural networks can generate high-dimensional representations of unstructured data such as images and texts, which encode complex properties difficult for human experts to perceive.

Take text analysis as an example. A recent study 52 utilizes 3.3 million paper abstracts in materials science to predict the thermoelectric properties of materials. The intuition is that the words currently used to describe a material may predict its hitherto undiscovered properties (Fig. 2 ). Compared with a random material, the materials predicted by the model are eight times more likely to be reported as thermoelectric in the next 5 years, suggesting that machine learning has the potential to substantially speed up knowledge discovery, especially as data continue to grow in scale and scope. Indeed, predicting the direction of new discoveries represents one of the most promising avenues for machine learning models, with neural networks being applied widely to biology 208 , physics 209 , 210 , mathematics 211 , chemistry 212 , medicine 213 and clinical applications 214 . Neural networks also offer a quantitative framework to probe the characteristics of creative products ranging from scientific papers 53 , journals 215 , organizations 148 , to paintings and movies 32 . Neural networks can also help to predict the reproducibility of papers from a variety of disciplines at scale 53 , 216 .

figure 2

This figure illustrates the word2vec skip-gram methods 52 , where the goal is to predict useful properties of materials using previous scientific literature. a , The architecture and training process of the word2vec skip-gram model, where the 3-layer, fully connected neural network learns the 200-dimensional representation (hidden layer) from the sparse vector for each word and its context in the literature (input layer). b , The top two principal components of the word embedding. Materials with similar features are close in the 2D space, allowing prediction of a material’s properties. Different targeted words are shown in different colours. Reproduced with permission from ref. 52 , Springer Nature Ltd.

While machine learning can offer high predictive accuracy, successful applications to the science of science face challenges, particularly regarding interpretability. Researchers may value transparent and interpretable findings for how a given feature influences an outcome, rather than a black-box model. The lack of interpretability also raises concerns about bias and fairness. In predicting reproducible patterns from data, machine learning models inevitably include and reproduce biases embedded in these data, often in non-transparent ways. The fairness of machine learning 217 is heavily debated in applications ranging from the criminal justice system to hiring processes. Effective and responsible use of machine learning in the science of science therefore requires thoughtful partnership between humans and machines 53 to build a reliable system accessible to scrutiny and modification.

Causal approaches

The preceding methods can reveal core facts about the workings of science and develop predictive capacity. Yet, they fail to capture causal relationships, which are particularly useful in assessing policy interventions. For example, how can we test whether a science policy boosts or hinders the performance of individuals, teams or institutions? The overarching idea of causal approaches is to construct some counterfactual world where two groups are identical to each other except that one group experiences a treatment that the other group does not.

Towards causation

Before engaging in causal approaches, it is useful to first consider the interpretative challenges of observational data. As observational data emerge from mechanisms that are not fully known or measured, an observed correlation may be driven by underlying forces that were not accounted for in the analysis. This challenge makes causal inference fundamentally difficult in observational data. An awareness of this issue is the first step in confronting it. It further motivates intermediate empirical approaches, including the use of matching strategies and fixed effects, that can help to confront (although not fully eliminate) the inference challenge. We first consider these approaches before turning to more fully causal methods.

Matching. Matching utilizes rich information to construct a control group that is similar to the treatment group on as many observable characteristics as possible before the treatment group is exposed to the treatment. Inferences can then be made by comparing the treatment and the matched control groups. Exact matching applies to categorical values, such as country, gender, discipline or affiliation 35 , 218 . Coarsened exact matching considers percentile bins of continuous variables and matches observations in the same bin 133 . Propensity score matching estimates the probability of receiving the ‘treatment’ on the basis of the controlled variables and uses the estimates to match treatment and control groups, which reduces the matching task from comparing the values of multiple covariates to comparing a single value 24 , 219 . Dynamic matching is useful for longitudinally matching variables that change over time 220 , 221 .

Fixed effects. Fixed effects are a powerful and now standard tool in controlling for confounders. A key requirement for using fixed effects is that there are multiple observations on the same subject or entity (person, field, institution and so on) 222 , 223 , 224 . The fixed effect works as a dummy variable that accounts for the role of any fixed characteristic of that entity. Consider the finding where gender-diverse teams produce higher-impact papers than same-gender teams do 225 . A confounder may be that individuals who tend to write high-impact papers may also be more likely to work in gender-diverse teams. By including individual fixed effects, one accounts for any fixed characteristics of individuals (such as IQ, cultural background or previous education) that might drive the relationship of interest.

In sum, matching and fixed effects methods reduce potential sources of bias in interpreting relationships between variables. Yet, confounders may persist in these studies. For instance, fixed effects do not control for unobserved factors that change with time within the given entity (for example, access to funding or new skills). Identifying casual effects convincingly will then typically require distinct research methods that we turn to next.

Quasi-experiments

Researchers in economics and other fields have developed a range of quasi-experimental methods to construct treatment and control groups. The key idea here is exploiting randomness from external events that differentially expose subjects to a particular treatment. Here we review three quasi-experimental methods: difference-in-differences, instrumental variables and regression discontinuity (Fig. 3 ).

figure 3

a – c , This figure presents illustrations of ( a ) differences-in-differences, ( b ) instrumental variables and ( c ) regression discontinuity methods. The solid line in b represents causal links and the dashed line represents the relationships that are not allowed, if the IV method is to produce causal inference.

Difference-in-differences. Difference-in-difference regression (DiD) investigates the effect of an unexpected event, comparing the affected group (the treated group) with an unaffected group (the control group). The control group is intended to provide the counterfactual path—what would have happened were it not for the unexpected event. Ideally, the treated and control groups are on virtually identical paths before the treatment event, but DiD can also work if the groups are on parallel paths (Fig. 3a ). For example, one study 226 examines how the premature death of superstar scientists affects the productivity of their previous collaborators. The control group are collaborators of superstars who did not die in the time frame. The two groups do not show significant differences in publications before a death event, yet upon the death of a star scientist, the treated collaborators on average experience a 5–8% decline in their quality-adjusted publication rates compared with the control group. DiD has wide applicability in the science of science, having been used to analyse the causal effects of grant design 24 , access costs to previous research 155 , 227 , university technology transfer policies 154 , intellectual property 228 , citation practices 229 , evolution of fields 221 and the impacts of paper retractions 230 , 231 , 232 . The DiD literature has grown especially rapidly in the field of economics, with substantial recent refinements 233 , 234 .

Instrumental variables. Another quasi-experimental approach utilizes ‘instrumental variables’ (IV). The goal is to determine the causal influence of some feature X on some outcome Y by using a third, instrumental variable. This instrumental variable is a quasi-random event that induces variation in X and, except for its impact through X , has no other effect on the outcome Y (Fig. 3b ). For example, consider a study of astronomy that seeks to understand how telescope time affects career advancement 235 . Here, one cannot simply look at the correlation between telescope time and career outcomes because many confounds (such as talent or grit) may influence both telescope time and career opportunities. Now consider the weather as an instrumental variable. Cloudy weather will, at random, reduce an astronomer’s observational time. Yet, the weather on particular nights is unlikely to correlate with a scientist’s innate qualities. The weather can then provide an instrumental variable to reveal a causal relationship between telescope time and career outcomes. Instrumental variables have been used to study local peer effects in research 151 , the impact of gender composition in scientific committees 236 , patents on future innovation 237 and taxes on inventor mobility 238 .

Regression discontinuity. In regression discontinuity, policies with an arbitrary threshold for receiving some benefit can be used to construct treatment and control groups (Fig. 3c ). Take the funding paylines for grant proposals as an example. Proposals with scores increasingly close to the payline are increasingly similar in their both observable and unobservable characteristics, yet only those projects with scores above the payline receive the funding. For example, a study 110 examines the effect of winning an early-career grant on the probability of winning a later, mid-career grant. The probability has a discontinuous jump across the initial grant’s payline, providing the treatment and control groups needed to estimate the causal effect of receiving a grant. This example utilizes the ‘sharp’ regression discontinuity that assumes treatment status to be fully determined by the cut-off. If we assume treatment status is only partly determined by the cut-off, we can use ‘fuzzy’ regression discontinuity designs. Here the probability of receiving a grant is used to estimate the future outcome 11 , 110 , 239 , 240 , 241 .

Although quasi-experiments are powerful tools, they face their own limitations. First, these approaches identify causal effects within a specific context and often engage small numbers of observations. How representative the samples are for broader populations or contexts is typically left as an open question. Second, the validity of the causal design is typically not ironclad. Researchers usually conduct different robustness checks to verify whether observable confounders have significant differences between the treated and control groups, before treatment. However, unobservable features may still differ between treatment and control groups. The quality of instrumental variables and the specific claim that they have no effect on the outcome except through the variable of interest, is also difficult to assess. Ultimately, researchers must rely partly on judgement to tell whether appropriate conditions are met for causal inference.

This section emphasized popular econometric approaches to causal inference. Other empirical approaches, such as graphical causal modelling 242 , 243 , also represent an important stream of work on assessing causal relationships. Such approaches usually represent causation as a directed acyclic graph, with nodes as variables and arrows between them as suspected causal relationships. In the science of science, the directed acyclic graph approach has been applied to quantify the causal effect of journal impact factor 244 and gender or racial bias 245 on citations. Graphical causal modelling has also triggered discussions on strengths and weaknesses compared to the econometrics methods 246 , 247 .

Experiments

In contrast to quasi-experimental approaches, laboratory and field experiments conduct direct randomization in assigning treatment and control groups. These methods engage explicitly in the data generation process, manipulating interventions to observe counterfactuals. These experiments are crafted to study mechanisms of specific interest and, by designing the experiment and formally randomizing, can produce especially rigorous causal inference.

Laboratory experiments. Laboratory experiments build counterfactual worlds in well-controlled laboratory environments. Researchers randomly assign participants to the treatment or control group and then manipulate the laboratory conditions to observe different outcomes in the two groups. For example, consider laboratory experiments on team performance and gender composition 144 , 248 . The researchers randomly assign participants into groups to perform tasks such as solving puzzles or brainstorming. Teams with a higher proportion of women are found to perform better on average, offering evidence that gender diversity is causally linked to team performance. Laboratory experiments can allow researchers to test forces that are otherwise hard to observe, such as how competition influences creativity 249 . Laboratory experiments have also been used to evaluate how journal impact factors shape scientists’ perceptions of rewards 250 and gender bias in hiring 251 .

Laboratory experiments allow for precise control of settings and procedures to isolate causal effects of interest. However, participants may behave differently in synthetic environments than in real-world settings, raising questions about the generalizability and replicability of the results 252 , 253 , 254 . To assess causal effects in real-world settings, researcher use randomized controlled trials.

Randomized controlled trials. A randomized controlled trial (RCT), or field experiment, is a staple for causal inference across a wide range of disciplines. RCTs randomly assign participants into the treatment and control conditions 255 and can be used not only to assess mechanisms but also to test real-world interventions such as policy change. The science of science has witnessed growing use of RCTs. For instance, a field experiment 146 investigated whether lower search costs for collaborators increased collaboration in grant applications. The authors randomly allocated principal investigators to face-to-face sessions in a medical school, and then measured participants’ chance of writing a grant proposal together. RCTs have also offered rich causal insights on peer review 256 , 257 , 258 , 259 , 260 and gender bias in science 261 , 262 , 263 .

While powerful, RCTs are difficult to conduct in the science of science, mainly for two reasons. The first concerns potential risks in a policy intervention. For instance, while randomizing funding across individuals could generate crucial causal insights for funders, it may also inadvertently harm participants’ careers 264 . Second, key questions in the science of science often require a long-time horizon to trace outcomes, which makes RCTs costly. It also raises the difficulty of replicating findings. A relative advantage of the quasi-experimental methods discussed earlier is that one can identify causal effects over potentially long periods of time in the historical record. On the other hand, quasi-experiments must be found as opposed to designed, and they often are not available for many questions of interest. While the best approaches are context dependent, a growing community of researchers is building platforms to facilitate RCTs for the science of science, aiming to lower their costs and increase their scale. Performing RCTs in partnership with science institutions can also contribute to timely, policy-relevant research that may substantially improve science decision-making and investments.

Research in the science of science has been empowered by the growth of high-scale data, new measurement approaches and an expanding range of empirical methods. These tools provide enormous capacity to test conceptual frameworks about science, discover factors impacting scientific productivity, predict key scientific outcomes and design policies that better facilitate future scientific progress. A careful appreciation of empirical techniques can help researchers to choose effective tools for questions of interest and propel the field. A better and broader understanding of these methodologies may also build bridges across diverse research communities, facilitating communication and collaboration, and better leveraging the value of diverse perspectives. The science of science is about turning scientific methods on the nature of science itself. The fruits of this work, with time, can guide researchers and research institutions to greater progress in discovery and understanding across the landscape of scientific inquiry.

Bush, V . S cience–the Endless Frontier: A Report to the President on a Program for Postwar Scientific Research (National Science Foundation, 1990).

Mokyr, J. The Gifts of Athena (Princeton Univ. Press, 2011).

Jones, B. F. in Rebuilding the Post-Pandemic Economy (eds Kearney, M. S. & Ganz, A.) 272–310 (Aspen Institute Press, 2021).

Wang, D. & Barabási, A.-L. The Science of Science (Cambridge Univ. Press, 2021).

Fortunato, S. et al. Science of science. Science 359 , eaao0185 (2018).

Article   PubMed   PubMed Central   Google Scholar  

Azoulay, P. et al. Toward a more scientific science. Science 361 , 1194–1197 (2018).

Article   PubMed   Google Scholar  

Clauset, A., Larremore, D. B. & Sinatra, R. Data-driven predictions in the science of science. Science 355 , 477–480 (2017).

Article   CAS   PubMed   Google Scholar  

Zeng, A. et al. The science of science: from the perspective of complex systems. Phys. Rep. 714 , 1–73 (2017).

Article   Google Scholar  

Lin, Z., Yin. Y., Liu, L. & Wang, D. SciSciNet: a large-scale open data lake for the science of science research. Sci. Data, https://doi.org/10.1038/s41597-023-02198-9 (2023).

Ahmadpoor, M. & Jones, B. F. The dual frontier: patented inventions and prior scientific advance. Science 357 , 583–587 (2017).

Azoulay, P., Graff Zivin, J. S., Li, D. & Sampat, B. N. Public R&D investments and private-sector patenting: evidence from NIH funding rules. Rev. Econ. Stud. 86 , 117–152 (2019).

Yin, Y., Dong, Y., Wang, K., Wang, D. & Jones, B. F. Public use and public funding of science. Nat. Hum. Behav. 6 , 1344–1350 (2022).

Merton, R. K. The Sociology of Science: Theoretical and Empirical Investigations (Univ. Chicago Press, 1973).

Kuhn, T. The Structure of Scientific Revolutions (Princeton Univ. Press, 2021).

Uzzi, B., Mukherjee, S., Stringer, M. & Jones, B. Atypical combinations and scientific impact. Science 342 , 468–472 (2013).

Zuckerman, H. Scientific Elite: Nobel Laureates in the United States (Transaction Publishers, 1977).

Wu, L., Wang, D. & Evans, J. A. Large teams develop and small teams disrupt science and technology. Nature 566 , 378–382 (2019).

Wuchty, S., Jones, B. F. & Uzzi, B. The increasing dominance of teams in production of knowledge. Science 316 , 1036–1039 (2007).

Foster, J. G., Rzhetsky, A. & Evans, J. A. Tradition and innovation in scientists’ research strategies. Am. Sociol. Rev. 80 , 875–908 (2015).

Wang, D., Song, C. & Barabási, A.-L. Quantifying long-term scientific impact. Science 342 , 127–132 (2013).

Clauset, A., Arbesman, S. & Larremore, D. B. Systematic inequality and hierarchy in faculty hiring networks. Sci. Adv. 1 , e1400005 (2015).

Ma, A., Mondragón, R. J. & Latora, V. Anatomy of funded research in science. Proc. Natl Acad. Sci. USA 112 , 14760–14765 (2015).

Article   CAS   PubMed   PubMed Central   Google Scholar  

Ma, Y. & Uzzi, B. Scientific prize network predicts who pushes the boundaries of science. Proc. Natl Acad. Sci. USA 115 , 12608–12615 (2018).

Azoulay, P., Graff Zivin, J. S. & Manso, G. Incentives and creativity: evidence from the academic life sciences. RAND J. Econ. 42 , 527–554 (2011).

Schor, S. & Karten, I. Statistical evaluation of medical journal manuscripts. JAMA 195 , 1123–1128 (1966).

Platt, J. R. Strong inference: certain systematic methods of scientific thinking may produce much more rapid progress than others. Science 146 , 347–353 (1964).

Ioannidis, J. P. Why most published research findings are false. PLoS Med. 2 , e124 (2005).

Simonton, D. K. Career landmarks in science: individual differences and interdisciplinary contrasts. Dev. Psychol. 27 , 119 (1991).

Way, S. F., Morgan, A. C., Clauset, A. & Larremore, D. B. The misleading narrative of the canonical faculty productivity trajectory. Proc. Natl Acad. Sci. USA 114 , E9216–E9223 (2017).

Sinatra, R., Wang, D., Deville, P., Song, C. & Barabási, A.-L. Quantifying the evolution of individual scientific impact. Science 354 , aaf5239 (2016).

Liu, L. et al. Hot streaks in artistic, cultural, and scientific careers. Nature 559 , 396–399 (2018).

Liu, L., Dehmamy, N., Chown, J., Giles, C. L. & Wang, D. Understanding the onset of hot streaks across artistic, cultural, and scientific careers. Nat. Commun. 12 , 5392 (2021).

Squazzoni, F. et al. Peer review and gender bias: a study on 145 scholarly journals. Sci. Adv. 7 , eabd0299 (2021).

Hofstra, B. et al. The diversity–innovation paradox in science. Proc. Natl Acad. Sci. USA 117 , 9284–9291 (2020).

Huang, J., Gates, A. J., Sinatra, R. & Barabási, A.-L. Historical comparison of gender inequality in scientific careers across countries and disciplines. Proc. Natl Acad. Sci. USA 117 , 4609–4616 (2020).

Gläser, J. & Laudel, G. Governing science: how science policy shapes research content. Eur. J. Sociol. 57 , 117–168 (2016).

Stephan, P. E. How Economics Shapes Science (Harvard Univ. Press, 2012).

Garfield, E. & Sher, I. H. New factors in the evaluation of scientific literature through citation indexing. Am. Doc. 14 , 195–201 (1963).

Article   CAS   Google Scholar  

de Solla Price, D. J. Networks of scientific papers. Science 149 , 510–515 (1965).

Etzkowitz, H., Kemelgor, C. & Uzzi, B. Athena Unbound: The Advancement of Women in Science and Technology (Cambridge Univ. Press, 2000).

Simonton, D. K. Scientific Genius: A Psychology of Science (Cambridge Univ. Press, 1988).

Khabsa, M. & Giles, C. L. The number of scholarly documents on the public web. PLoS ONE 9 , e93949 (2014).

Xia, F., Wang, W., Bekele, T. M. & Liu, H. Big scholarly data: a survey. IEEE Trans. Big Data 3 , 18–35 (2017).

Evans, J. A. & Foster, J. G. Metaknowledge. Science 331 , 721–725 (2011).

Milojević, S. Quantifying the cognitive extent of science. J. Informetr. 9 , 962–973 (2015).

Rzhetsky, A., Foster, J. G., Foster, I. T. & Evans, J. A. Choosing experiments to accelerate collective discovery. Proc. Natl Acad. Sci. USA 112 , 14569–14574 (2015).

Poncela-Casasnovas, J., Gerlach, M., Aguirre, N. & Amaral, L. A. Large-scale analysis of micro-level citation patterns reveals nuanced selection criteria. Nat. Hum. Behav. 3 , 568–575 (2019).

Hardwicke, T. E. et al. Data availability, reusability, and analytic reproducibility: evaluating the impact of a mandatory open data policy at the journal Cognition. R. Soc. Open Sci. 5 , 180448 (2018).

Nagaraj, A., Shears, E. & de Vaan, M. Improving data access democratizes and diversifies science. Proc. Natl Acad. Sci. USA 117 , 23490–23498 (2020).

Bravo, G., Grimaldo, F., López-Iñesta, E., Mehmani, B. & Squazzoni, F. The effect of publishing peer review reports on referee behavior in five scholarly journals. Nat. Commun. 10 , 322 (2019).

Tran, D. et al. An open review of open review: a critical analysis of the machine learning conference review process. Preprint at https://doi.org/10.48550/arXiv.2010.05137 (2020).

Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571 , 95–98 (2019).

Yang, Y., Wu, Y. & Uzzi, B. Estimating the deep replicability of scientific findings using human and artificial intelligence. Proc. Natl Acad. Sci. USA 117 , 10762–10768 (2020).

Mukherjee, S., Uzzi, B., Jones, B. & Stringer, M. A new method for identifying recombinations of existing knowledge associated with high‐impact innovation. J. Prod. Innov. Manage. 33 , 224–236 (2016).

Leahey, E., Beckman, C. M. & Stanko, T. L. Prominent but less productive: the impact of interdisciplinarity on scientists’ research. Adm. Sci. Q. 62 , 105–139 (2017).

Sauermann, H. & Haeussler, C. Authorship and contribution disclosures. Sci. Adv. 3 , e1700404 (2017).

Oliveira, D. F. M., Ma, Y., Woodruff, T. K. & Uzzi, B. Comparison of National Institutes of Health grant amounts to first-time male and female principal investigators. JAMA 321 , 898–900 (2019).

Yang, Y., Chawla, N. V. & Uzzi, B. A network’s gender composition and communication pattern predict women’s leadership success. Proc. Natl Acad. Sci. USA 116 , 2033–2038 (2019).

Way, S. F., Larremore, D. B. & Clauset, A. Gender, productivity, and prestige in computer science faculty hiring networks. In Proc. 25th International Conference on World Wide Web 1169–1179. (ACM 2016)

Malmgren, R. D., Ottino, J. M. & Amaral, L. A. N. The role of mentorship in protege performance. Nature 465 , 622–626 (2010).

Ma, Y., Mukherjee, S. & Uzzi, B. Mentorship and protégé success in STEM fields. Proc. Natl Acad. Sci. USA 117 , 14077–14083 (2020).

Börner, K. et al. Skill discrepancies between research, education, and jobs reveal the critical need to supply soft skills for the data economy. Proc. Natl Acad. Sci. USA 115 , 12630–12637 (2018).

Biasi, B. & Ma, S. The Education-Innovation Gap (National Bureau of Economic Research Working papers, 2020).

Bornmann, L. Do altmetrics point to the broader impact of research? An overview of benefits and disadvantages of altmetrics. J. Informetr. 8 , 895–903 (2014).

Cleary, E. G., Beierlein, J. M., Khanuja, N. S., McNamee, L. M. & Ledley, F. D. Contribution of NIH funding to new drug approvals 2010–2016. Proc. Natl Acad. Sci. USA 115 , 2329–2334 (2018).

Spector, J. M., Harrison, R. S. & Fishman, M. C. Fundamental science behind today’s important medicines. Sci. Transl. Med. 10 , eaaq1787 (2018).

Haunschild, R. & Bornmann, L. How many scientific papers are mentioned in policy-related documents? An empirical investigation using Web of Science and Altmetric data. Scientometrics 110 , 1209–1216 (2017).

Yin, Y., Gao, J., Jones, B. F. & Wang, D. Coevolution of policy and science during the pandemic. Science 371 , 128–130 (2021).

Sugimoto, C. R., Work, S., Larivière, V. & Haustein, S. Scholarly use of social media and altmetrics: a review of the literature. J. Assoc. Inf. Sci. Technol. 68 , 2037–2062 (2017).

Dunham, I. Human genes: time to follow the roads less traveled? PLoS Biol. 16 , e3000034 (2018).

Kustatscher, G. et al. Understudied proteins: opportunities and challenges for functional proteomics. Nat. Methods 19 , 774–779 (2022).

Rosenthal, R. The file drawer problem and tolerance for null results. Psychol. Bull. 86 , 638 (1979).

Franco, A., Malhotra, N. & Simonovits, G. Publication bias in the social sciences: unlocking the file drawer. Science 345 , 1502–1505 (2014).

Vera-Baceta, M.-A., Thelwall, M. & Kousha, K. Web of Science and Scopus language coverage. Scientometrics 121 , 1803–1813 (2019).

Waltman, L. A review of the literature on citation impact indicators. J. Informetr. 10 , 365–391 (2016).

Garfield, E. & Merton, R. K. Citation Indexing: Its Theory and Application in Science, Technology, and Humanities (Wiley, 1979).

Kelly, B., Papanikolaou, D., Seru, A. & Taddy, M. Measuring Technological Innovation Over the Long Run Report No. 0898-2937 (National Bureau of Economic Research, 2018).

Kogan, L., Papanikolaou, D., Seru, A. & Stoffman, N. Technological innovation, resource allocation, and growth. Q. J. Econ. 132 , 665–712 (2017).

Hall, B. H., Jaffe, A. & Trajtenberg, M. Market value and patent citations. RAND J. Econ. 36 , 16–38 (2005).

Google Scholar  

Yan, E. & Ding, Y. Applying centrality measures to impact analysis: a coauthorship network analysis. J. Am. Soc. Inf. Sci. Technol. 60 , 2107–2118 (2009).

Radicchi, F., Fortunato, S., Markines, B. & Vespignani, A. Diffusion of scientific credits and the ranking of scientists. Phys. Rev. E 80 , 056103 (2009).

Bollen, J., Rodriquez, M. A. & Van de Sompel, H. Journal status. Scientometrics 69 , 669–687 (2006).

Bergstrom, C. T., West, J. D. & Wiseman, M. A. The eigenfactor™ metrics. J. Neurosci. 28 , 11433–11434 (2008).

Cronin, B. & Sugimoto, C. R. Beyond Bibliometrics: Harnessing Multidimensional Indicators of Scholarly Impact (MIT Press, 2014).

Hicks, D., Wouters, P., Waltman, L., De Rijcke, S. & Rafols, I. Bibliometrics: the Leiden Manifesto for research metrics. Nature 520 , 429–431 (2015).

Catalini, C., Lacetera, N. & Oettl, A. The incidence and role of negative citations in science. Proc. Natl Acad. Sci. USA 112 , 13823–13826 (2015).

Alcacer, J. & Gittelman, M. Patent citations as a measure of knowledge flows: the influence of examiner citations. Rev. Econ. Stat. 88 , 774–779 (2006).

Ding, Y. et al. Content‐based citation analysis: the next generation of citation analysis. J. Assoc. Inf. Sci. Technol. 65 , 1820–1833 (2014).

Teufel, S., Siddharthan, A. & Tidhar, D. Automatic classification of citation function. In Proc. 2006 Conference on Empirical Methods in Natural Language Processing, 103–110 (Association for Computational Linguistics 2006)

Seeber, M., Cattaneo, M., Meoli, M. & Malighetti, P. Self-citations as strategic response to the use of metrics for career decisions. Res. Policy 48 , 478–491 (2019).

Pendlebury, D. A. The use and misuse of journal metrics and other citation indicators. Arch. Immunol. Ther. Exp. 57 , 1–11 (2009).

Biagioli, M. Watch out for cheats in citation game. Nature 535 , 201 (2016).

Jo, W. S., Liu, L. & Wang, D. See further upon the giants: quantifying intellectual lineage in science. Quant. Sci. Stud. 3 , 319–330 (2022).

Boyack, K. W., Klavans, R. & Börner, K. Mapping the backbone of science. Scientometrics 64 , 351–374 (2005).

Gates, A. J., Ke, Q., Varol, O. & Barabási, A.-L. Nature’s reach: narrow work has broad impact. Nature 575 , 32–34 (2019).

Börner, K., Penumarthy, S., Meiss, M. & Ke, W. Mapping the diffusion of scholarly knowledge among major US research institutions. Scientometrics 68 , 415–426 (2006).

King, D. A. The scientific impact of nations. Nature 430 , 311–316 (2004).

Pan, R. K., Kaski, K. & Fortunato, S. World citation and collaboration networks: uncovering the role of geography in science. Sci. Rep. 2 , 902 (2012).

Jaffe, A. B., Trajtenberg, M. & Henderson, R. Geographic localization of knowledge spillovers as evidenced by patent citations. Q. J. Econ. 108 , 577–598 (1993).

Funk, R. J. & Owen-Smith, J. A dynamic network measure of technological change. Manage. Sci. 63 , 791–817 (2017).

Yegros-Yegros, A., Rafols, I. & D’este, P. Does interdisciplinary research lead to higher citation impact? The different effect of proximal and distal interdisciplinarity. PLoS ONE 10 , e0135095 (2015).

Larivière, V., Haustein, S. & Börner, K. Long-distance interdisciplinarity leads to higher scientific impact. PLoS ONE 10 , e0122565 (2015).

Fleming, L., Greene, H., Li, G., Marx, M. & Yao, D. Government-funded research increasingly fuels innovation. Science 364 , 1139–1141 (2019).

Bowen, A. & Casadevall, A. Increasing disparities between resource inputs and outcomes, as measured by certain health deliverables, in biomedical research. Proc. Natl Acad. Sci. USA 112 , 11335–11340 (2015).

Li, D., Azoulay, P. & Sampat, B. N. The applied value of public investments in biomedical research. Science 356 , 78–81 (2017).

Lehman, H. C. Age and Achievement (Princeton Univ. Press, 2017).

Simonton, D. K. Creative productivity: a predictive and explanatory model of career trajectories and landmarks. Psychol. Rev. 104 , 66 (1997).

Duch, J. et al. The possible role of resource requirements and academic career-choice risk on gender differences in publication rate and impact. PLoS ONE 7 , e51332 (2012).

Wang, Y., Jones, B. F. & Wang, D. Early-career setback and future career impact. Nat. Commun. 10 , 4331 (2019).

Bol, T., de Vaan, M. & van de Rijt, A. The Matthew effect in science funding. Proc. Natl Acad. Sci. USA 115 , 4887–4890 (2018).

Jones, B. F. Age and great invention. Rev. Econ. Stat. 92 , 1–14 (2010).

Newman, M. Networks (Oxford Univ. Press, 2018).

Mazloumian, A., Eom, Y.-H., Helbing, D., Lozano, S. & Fortunato, S. How citation boosts promote scientific paradigm shifts and nobel prizes. PLoS ONE 6 , e18975 (2011).

Hirsch, J. E. An index to quantify an individual’s scientific research output. Proc. Natl Acad. Sci. USA 102 , 16569–16572 (2005).

Alonso, S., Cabrerizo, F. J., Herrera-Viedma, E. & Herrera, F. h-index: a review focused in its variants, computation and standardization for different scientific fields. J. Informetr. 3 , 273–289 (2009).

Egghe, L. An improvement of the h-index: the g-index. ISSI Newsl. 2 , 8–9 (2006).

Kaur, J., Radicchi, F. & Menczer, F. Universality of scholarly impact metrics. J. Informetr. 7 , 924–932 (2013).

Majeti, D. et al. Scholar plot: design and evaluation of an information interface for faculty research performance. Front. Res. Metr. Anal. 4 , 6 (2020).

Sidiropoulos, A., Katsaros, D. & Manolopoulos, Y. Generalized Hirsch h-index for disclosing latent facts in citation networks. Scientometrics 72 , 253–280 (2007).

Jones, B. F. & Weinberg, B. A. Age dynamics in scientific creativity. Proc. Natl Acad. Sci. USA 108 , 18910–18914 (2011).

Dennis, W. Age and productivity among scientists. Science 123 , 724–725 (1956).

Sanyal, D. K., Bhowmick, P. K. & Das, P. P. A review of author name disambiguation techniques for the PubMed bibliographic database. J. Inf. Sci. 47 , 227–254 (2021).

Haak, L. L., Fenner, M., Paglione, L., Pentz, E. & Ratner, H. ORCID: a system to uniquely identify researchers. Learn. Publ. 25 , 259–264 (2012).

Malmgren, R. D., Ottino, J. M. & Amaral, L. A. N. The role of mentorship in protégé performance. Nature 465 , 662–667 (2010).

Oettl, A. Reconceptualizing stars: scientist helpfulness and peer performance. Manage. Sci. 58 , 1122–1140 (2012).

Morgan, A. C. et al. The unequal impact of parenthood in academia. Sci. Adv. 7 , eabd1996 (2021).

Morgan, A. C. et al. Socioeconomic roots of academic faculty. Nat. Hum. Behav. 6 , 1625–1633 (2022).

San Francisco Declaration on Research Assessment (DORA) (American Society for Cell Biology, 2012).

Falk‐Krzesinski, H. J. et al. Advancing the science of team science. Clin. Transl. Sci. 3 , 263–266 (2010).

Cooke, N. J. et al. Enhancing the Effectiveness of Team Science (National Academies Press, 2015).

Börner, K. et al. A multi-level systems perspective for the science of team science. Sci. Transl. Med. 2 , 49cm24 (2010).

Leahey, E. From sole investigator to team scientist: trends in the practice and study of research collaboration. Annu. Rev. Sociol. 42 , 81–100 (2016).

AlShebli, B. K., Rahwan, T. & Woon, W. L. The preeminence of ethnic diversity in scientific collaboration. Nat. Commun. 9 , 5163 (2018).

Hsiehchen, D., Espinoza, M. & Hsieh, A. Multinational teams and diseconomies of scale in collaborative research. Sci. Adv. 1 , e1500211 (2015).

Koning, R., Samila, S. & Ferguson, J.-P. Who do we invent for? Patents by women focus more on women’s health, but few women get to invent. Science 372 , 1345–1348 (2021).

Barabâsi, A.-L. et al. Evolution of the social network of scientific collaborations. Physica A 311 , 590–614 (2002).

Newman, M. E. Scientific collaboration networks. I. Network construction and fundamental results. Phys. Rev. E 64 , 016131 (2001).

Newman, M. E. Scientific collaboration networks. II. Shortest paths, weighted networks, and centrality. Phys. Rev. E 64 , 016132 (2001).

Palla, G., Barabási, A.-L. & Vicsek, T. Quantifying social group evolution. Nature 446 , 664–667 (2007).

Ross, M. B. et al. Women are credited less in science than men. Nature 608 , 135–145 (2022).

Shen, H.-W. & Barabási, A.-L. Collective credit allocation in science. Proc. Natl Acad. Sci. USA 111 , 12325–12330 (2014).

Merton, R. K. Matthew effect in science. Science 159 , 56–63 (1968).

Ni, C., Smith, E., Yuan, H., Larivière, V. & Sugimoto, C. R. The gendered nature of authorship. Sci. Adv. 7 , eabe4639 (2021).

Woolley, A. W., Chabris, C. F., Pentland, A., Hashmi, N. & Malone, T. W. Evidence for a collective intelligence factor in the performance of human groups. Science 330 , 686–688 (2010).

Feldon, D. F. et al. Postdocs’ lab engagement predicts trajectories of PhD students’ skill development. Proc. Natl Acad. Sci. USA 116 , 20910–20916 (2019).

Boudreau, K. J. et al. A field experiment on search costs and the formation of scientific collaborations. Rev. Econ. Stat. 99 , 565–576 (2017).

Holcombe, A. O. Contributorship, not authorship: use CRediT to indicate who did what. Publications 7 , 48 (2019).

Murray, D. et al. Unsupervised embedding of trajectories captures the latent structure of mobility. Preprint at https://doi.org/10.48550/arXiv.2012.02785 (2020).

Deville, P. et al. Career on the move: geography, stratification, and scientific impact. Sci. Rep. 4 , 4770 (2014).

Edmunds, L. D. et al. Why do women choose or reject careers in academic medicine? A narrative review of empirical evidence. Lancet 388 , 2948–2958 (2016).

Waldinger, F. Peer effects in science: evidence from the dismissal of scientists in Nazi Germany. Rev. Econ. Stud. 79 , 838–861 (2012).

Agrawal, A., McHale, J. & Oettl, A. How stars matter: recruiting and peer effects in evolutionary biology. Res. Policy 46 , 853–867 (2017).

Fiore, S. M. Interdisciplinarity as teamwork: how the science of teams can inform team science. Small Group Res. 39 , 251–277 (2008).

Hvide, H. K. & Jones, B. F. University innovation and the professor’s privilege. Am. Econ. Rev. 108 , 1860–1898 (2018).

Murray, F., Aghion, P., Dewatripont, M., Kolev, J. & Stern, S. Of mice and academics: examining the effect of openness on innovation. Am. Econ. J. Econ. Policy 8 , 212–252 (2016).

Radicchi, F., Fortunato, S. & Castellano, C. Universality of citation distributions: toward an objective measure of scientific impact. Proc. Natl Acad. Sci. USA 105 , 17268–17272 (2008).

Waltman, L., van Eck, N. J. & van Raan, A. F. Universality of citation distributions revisited. J. Am. Soc. Inf. Sci. Technol. 63 , 72–77 (2012).

Barabási, A.-L. & Albert, R. Emergence of scaling in random networks. Science 286 , 509–512 (1999).

de Solla Price, D. A general theory of bibliometric and other cumulative advantage processes. J. Am. Soc. Inf. Sci. 27 , 292–306 (1976).

Cole, S. Age and scientific performance. Am. J. Sociol. 84 , 958–977 (1979).

Ke, Q., Ferrara, E., Radicchi, F. & Flammini, A. Defining and identifying sleeping beauties in science. Proc. Natl Acad. Sci. USA 112 , 7426–7431 (2015).

Bornmann, L., de Moya Anegón, F. & Leydesdorff, L. Do scientific advancements lean on the shoulders of giants? A bibliometric investigation of the Ortega hypothesis. PLoS ONE 5 , e13327 (2010).

Mukherjee, S., Romero, D. M., Jones, B. & Uzzi, B. The nearly universal link between the age of past knowledge and tomorrow’s breakthroughs in science and technology: the hotspot. Sci. Adv. 3 , e1601315 (2017).

Packalen, M. & Bhattacharya, J. NIH funding and the pursuit of edge science. Proc. Natl Acad. Sci. USA 117 , 12011–12016 (2020).

Zeng, A., Fan, Y., Di, Z., Wang, Y. & Havlin, S. Fresh teams are associated with original and multidisciplinary research. Nat. Hum. Behav. 5 , 1314–1322 (2021).

Newman, M. E. The structure of scientific collaboration networks. Proc. Natl Acad. Sci. USA 98 , 404–409 (2001).

Larivière, V., Ni, C., Gingras, Y., Cronin, B. & Sugimoto, C. R. Bibliometrics: global gender disparities in science. Nature 504 , 211–213 (2013).

West, J. D., Jacquet, J., King, M. M., Correll, S. J. & Bergstrom, C. T. The role of gender in scholarly authorship. PLoS ONE 8 , e66212 (2013).

Gao, J., Yin, Y., Myers, K. R., Lakhani, K. R. & Wang, D. Potentially long-lasting effects of the pandemic on scientists. Nat. Commun. 12 , 6188 (2021).

Jones, B. F., Wuchty, S. & Uzzi, B. Multi-university research teams: shifting impact, geography, and stratification in science. Science 322 , 1259–1262 (2008).

Chu, J. S. & Evans, J. A. Slowed canonical progress in large fields of science. Proc. Natl Acad. Sci. USA 118 , e2021636118 (2021).

Wang, J., Veugelers, R. & Stephan, P. Bias against novelty in science: a cautionary tale for users of bibliometric indicators. Res. Policy 46 , 1416–1436 (2017).

Stringer, M. J., Sales-Pardo, M. & Amaral, L. A. Statistical validation of a global model for the distribution of the ultimate number of citations accrued by papers published in a scientific journal. J. Assoc. Inf. Sci. Technol. 61 , 1377–1385 (2010).

Bianconi, G. & Barabási, A.-L. Bose-Einstein condensation in complex networks. Phys. Rev. Lett. 86 , 5632 (2001).

Bianconi, G. & Barabási, A.-L. Competition and multiscaling in evolving networks. Europhys. Lett. 54 , 436 (2001).

Yin, Y. & Wang, D. The time dimension of science: connecting the past to the future. J. Informetr. 11 , 608–621 (2017).

Pan, R. K., Petersen, A. M., Pammolli, F. & Fortunato, S. The memory of science: Inflation, myopia, and the knowledge network. J. Informetr. 12 , 656–678 (2018).

Yin, Y., Wang, Y., Evans, J. A. & Wang, D. Quantifying the dynamics of failure across science, startups and security. Nature 575 , 190–194 (2019).

Candia, C. & Uzzi, B. Quantifying the selective forgetting and integration of ideas in science and technology. Am. Psychol. 76 , 1067 (2021).

Milojević, S. Principles of scientific research team formation and evolution. Proc. Natl Acad. Sci. USA 111 , 3984–3989 (2014).

Guimera, R., Uzzi, B., Spiro, J. & Amaral, L. A. N. Team assembly mechanisms determine collaboration network structure and team performance. Science 308 , 697–702 (2005).

Newman, M. E. Coauthorship networks and patterns of scientific collaboration. Proc. Natl Acad. Sci. USA 101 , 5200–5205 (2004).

Newman, M. E. Clustering and preferential attachment in growing networks. Phys. Rev. E 64 , 025102 (2001).

Iacopini, I., Milojević, S. & Latora, V. Network dynamics of innovation processes. Phys. Rev. Lett. 120 , 048301 (2018).

Kuhn, T., Perc, M. & Helbing, D. Inheritance patterns in citation networks reveal scientific memes. Phys. Rev. 4 , 041036 (2014).

Jia, T., Wang, D. & Szymanski, B. K. Quantifying patterns of research-interest evolution. Nat. Hum. Behav. 1 , 0078 (2017).

Zeng, A. et al. Increasing trend of scientists to switch between topics. Nat. Commun. https://doi.org/10.1038/s41467-019-11401-8 (2019).

Siudem, G., Żogała-Siudem, B., Cena, A. & Gagolewski, M. Three dimensions of scientific impact. Proc. Natl Acad. Sci. USA 117 , 13896–13900 (2020).

Petersen, A. M. et al. Reputation and impact in academic careers. Proc. Natl Acad. Sci. USA 111 , 15316–15321 (2014).

Jin, C., Song, C., Bjelland, J., Canright, G. & Wang, D. Emergence of scaling in complex substitutive systems. Nat. Hum. Behav. 3 , 837–846 (2019).

Hofman, J. M. et al. Integrating explanation and prediction in computational social science. Nature 595 , 181–188 (2021).

Lazer, D. et al. Computational social science. Science 323 , 721–723 (2009).

Lazer, D. M. et al. Computational social science: obstacles and opportunities. Science 369 , 1060–1062 (2020).

Albert, R. & Barabási, A.-L. Statistical mechanics of complex networks. Rev. Mod. Phys. 74 , 47 (2002).

Newman, M. E. The structure and function of complex networks. SIAM Rev. 45 , 167–256 (2003).

Song, C., Qu, Z., Blumm, N. & Barabási, A.-L. Limits of predictability in human mobility. Science 327 , 1018–1021 (2010).

Alessandretti, L., Aslak, U. & Lehmann, S. The scales of human mobility. Nature 587 , 402–407 (2020).

Pastor-Satorras, R. & Vespignani, A. Epidemic spreading in scale-free networks. Phys. Rev. Lett. 86 , 3200 (2001).

Pastor-Satorras, R., Castellano, C., Van Mieghem, P. & Vespignani, A. Epidemic processes in complex networks. Rev. Mod. Phys. 87 , 925 (2015).

Goodfellow, I., Bengio, Y. & Courville, A. Deep Learning (MIT Press, 2016).

Bishop, C. M. Pattern Recognition and Machine Learning (Springer, 2006).

Dong, Y., Johnson, R. A. & Chawla, N. V. Will this paper increase your h-index? Scientific impact prediction. In Proc. 8th ACM International Conference on Web Search and Data Mining, 149–158 (ACM 2015)

Xiao, S. et al. On modeling and predicting individual paper citation count over time. In IJCAI, 2676–2682 (IJCAI, 2016)

Fortunato, S. Community detection in graphs. Phys. Rep. 486 , 75–174 (2010).

Chen, C. Science mapping: a systematic review of the literature. J. Data Inf. Sci. 2 , 1–40 (2017).

CAS   Google Scholar  

Van Eck, N. J. & Waltman, L. Citation-based clustering of publications using CitNetExplorer and VOSviewer. Scientometrics 111 , 1053–1070 (2017).

LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521 , 436–444 (2015).

Senior, A. W. et al. Improved protein structure prediction using potentials from deep learning. Nature 577 , 706–710 (2020).

Krenn, M. & Zeilinger, A. Predicting research trends with semantic and neural networks with an application in quantum physics. Proc. Natl Acad. Sci. USA 117 , 1910–1916 (2020).

Iten, R., Metger, T., Wilming, H., Del Rio, L. & Renner, R. Discovering physical concepts with neural networks. Phys. Rev. Lett. 124 , 010508 (2020).

Guimerà, R. et al. A Bayesian machine scientist to aid in the solution of challenging scientific problems. Sci. Adv. 6 , eaav6971 (2020).

Segler, M. H., Preuss, M. & Waller, M. P. Planning chemical syntheses with deep neural networks and symbolic AI. Nature 555 , 604–610 (2018).

Ryu, J. Y., Kim, H. U. & Lee, S. Y. Deep learning improves prediction of drug–drug and drug–food interactions. Proc. Natl Acad. Sci. USA 115 , E4304–E4311 (2018).

Kermany, D. S. et al. Identifying medical diagnoses and treatable diseases by image-based deep learning. Cell 172 , 1122–1131.e9 (2018).

Peng, H., Ke, Q., Budak, C., Romero, D. M. & Ahn, Y.-Y. Neural embeddings of scholarly periodicals reveal complex disciplinary organizations. Sci. Adv. 7 , eabb9004 (2021).

Youyou, W., Yang, Y. & Uzzi, B. A discipline-wide investigation of the replicability of psychology papers over the past two decades. Proc. Natl Acad. Sci. USA 120 , e2208863120 (2023).

Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K. & Galstyan, A. A survey on bias and fairness in machine learning. ACM Computing Surveys (CSUR) 54 , 1–35 (2021).

Way, S. F., Morgan, A. C., Larremore, D. B. & Clauset, A. Productivity, prominence, and the effects of academic environment. Proc. Natl Acad. Sci. USA 116 , 10729–10733 (2019).

Li, W., Aste, T., Caccioli, F. & Livan, G. Early coauthorship with top scientists predicts success in academic careers. Nat. Commun. 10 , 5170 (2019).

Hendry, D. F., Pagan, A. R. & Sargan, J. D. Dynamic specification. Handb. Econ. 2 , 1023–1100 (1984).

Jin, C., Ma, Y. & Uzzi, B. Scientific prizes and the extraordinary growth of scientific topics. Nat. Commun. 12 , 5619 (2021).

Azoulay, P., Ganguli, I. & Zivin, J. G. The mobility of elite life scientists: professional and personal determinants. Res. Policy 46 , 573–590 (2017).

Slavova, K., Fosfuri, A. & De Castro, J. O. Learning by hiring: the effects of scientists’ inbound mobility on research performance in academia. Organ. Sci. 27 , 72–89 (2016).

Sarsons, H. Recognition for group work: gender differences in academia. Am. Econ. Rev. 107 , 141–145 (2017).

Campbell, L. G., Mehtani, S., Dozier, M. E. & Rinehart, J. Gender-heterogeneous working groups produce higher quality science. PLoS ONE 8 , e79147 (2013).

Azoulay, P., Graff Zivin, J. S. & Wang, J. Superstar extinction. Q. J. Econ. 125 , 549–589 (2010).

Furman, J. L. & Stern, S. Climbing atop the shoulders of giants: the impact of institutions on cumulative research. Am. Econ. Rev. 101 , 1933–1963 (2011).

Williams, H. L. Intellectual property rights and innovation: evidence from the human genome. J. Polit. Econ. 121 , 1–27 (2013).

Rubin, A. & Rubin, E. Systematic Bias in the Progress of Research. J. Polit. Econ. 129 , 2666–2719 (2021).

Lu, S. F., Jin, G. Z., Uzzi, B. & Jones, B. The retraction penalty: evidence from the Web of Science. Sci. Rep. 3 , 3146 (2013).

Jin, G. Z., Jones, B., Lu, S. F. & Uzzi, B. The reverse Matthew effect: consequences of retraction in scientific teams. Rev. Econ. Stat. 101 , 492–506 (2019).

Azoulay, P., Bonatti, A. & Krieger, J. L. The career effects of scandal: evidence from scientific retractions. Res. Policy 46 , 1552–1569 (2017).

Goodman-Bacon, A. Difference-in-differences with variation in treatment timing. J. Econ. 225 , 254–277 (2021).

Callaway, B. & Sant’Anna, P. H. Difference-in-differences with multiple time periods. J. Econ. 225 , 200–230 (2021).

Hill, R. Searching for Superstars: Research Risk and Talent Discovery in Astronomy Working Paper (Massachusetts Institute of Technology, 2019).

Bagues, M., Sylos-Labini, M. & Zinovyeva, N. Does the gender composition of scientific committees matter? Am. Econ. Rev. 107 , 1207–1238 (2017).

Sampat, B. & Williams, H. L. How do patents affect follow-on innovation? Evidence from the human genome. Am. Econ. Rev. 109 , 203–236 (2019).

Moretti, E. & Wilson, D. J. The effect of state taxes on the geographical location of top earners: evidence from star scientists. Am. Econ. Rev. 107 , 1858–1903 (2017).

Jacob, B. A. & Lefgren, L. The impact of research grant funding on scientific productivity. J. Public Econ. 95 , 1168–1177 (2011).

Li, D. Expertise versus bias in evaluation: evidence from the NIH. Am. Econ. J. Appl. Econ. 9 , 60–92 (2017).

Pearl, J. Causal diagrams for empirical research. Biometrika 82 , 669–688 (1995).

Pearl, J. & Mackenzie, D. The Book of Why: The New Science of Cause and Effect (Basic Books, 2018).

Traag, V. A. Inferring the causal effect of journals on citations. Quant. Sci. Stud. 2 , 496–504 (2021).

Traag, V. & Waltman, L. Causal foundations of bias, disparity and fairness. Preprint at https://doi.org/10.48550/arXiv.2207.13665 (2022).

Imbens, G. W. Potential outcome and directed acyclic graph approaches to causality: relevance for empirical practice in economics. J. Econ. Lit. 58 , 1129–1179 (2020).

Heckman, J. J. & Pinto, R. Causality and Econometrics (National Bureau of Economic Research, 2022).

Aggarwal, I., Woolley, A. W., Chabris, C. F. & Malone, T. W. The impact of cognitive style diversity on implicit learning in teams. Front. Psychol. 10 , 112 (2019).

Balietti, S., Goldstone, R. L. & Helbing, D. Peer review and competition in the Art Exhibition Game. Proc. Natl Acad. Sci. USA 113 , 8414–8419 (2016).

Paulus, F. M., Rademacher, L., Schäfer, T. A. J., Müller-Pinzler, L. & Krach, S. Journal impact factor shapes scientists’ reward signal in the prospect of publication. PLoS ONE 10 , e0142537 (2015).

Williams, W. M. & Ceci, S. J. National hiring experiments reveal 2:1 faculty preference for women on STEM tenure track. Proc. Natl Acad. Sci. USA 112 , 5360–5365 (2015).

Collaboration, O. S. Estimating the reproducibility of psychological science. Science 349 , aac4716 (2015).

Camerer, C. F. et al. Evaluating replicability of laboratory experiments in economics. Science 351 , 1433–1436 (2016).

Camerer, C. F. et al. Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015. Nat. Hum. Behav. 2 , 637–644 (2018).

Duflo, E. & Banerjee, A. Handbook of Field Experiments (Elsevier, 2017).

Tomkins, A., Zhang, M. & Heavlin, W. D. Reviewer bias in single versus double-blind peer review. Proc. Natl Acad. Sci. USA 114 , 12708–12713 (2017).

Blank, R. M. The effects of double-blind versus single-blind reviewing: experimental evidence from the American Economic Review. Am. Econ. Rev. 81 , 1041–1067 (1991).

Boudreau, K. J., Guinan, E. C., Lakhani, K. R. & Riedl, C. Looking across and looking beyond the knowledge frontier: intellectual distance, novelty, and resource allocation in science. Manage. Sci. 62 , 2765–2783 (2016).

Lane, J. et al. When Do Experts Listen to Other Experts? The Role of Negative Information in Expert Evaluations for Novel Projects Working Paper #21-007 (Harvard Business School, 2020).

Teplitskiy, M. et al. Do Experts Listen to Other Experts? Field Experimental Evidence from Scientific Peer Review (Harvard Business School, 2019).

Moss-Racusin, C. A., Dovidio, J. F., Brescoll, V. L., Graham, M. J. & Handelsman, J. Science faculty’s subtle gender biases favor male students. Proc. Natl Acad. Sci. USA 109 , 16474–16479 (2012).

Forscher, P. S., Cox, W. T., Brauer, M. & Devine, P. G. Little race or gender bias in an experiment of initial review of NIH R01 grant proposals. Nat. Hum. Behav. 3 , 257–264 (2019).

Dennehy, T. C. & Dasgupta, N. Female peer mentors early in college increase women’s positive academic experiences and retention in engineering. Proc. Natl Acad. Sci. USA 114 , 5964–5969 (2017).

Azoulay, P. Turn the scientific method on ourselves. Nature 484 , 31–32 (2012).

Download references

Acknowledgements

The authors thank all members of the Center for Science of Science and Innovation (CSSI) for invaluable comments. This work was supported by the Air Force Office of Scientific Research under award number FA9550-19-1-0354, National Science Foundation grant SBE 1829344, and the Alfred P. Sloan Foundation G-2019-12485.

Author information

Authors and affiliations.

Center for Science of Science and Innovation, Northwestern University, Evanston, IL, USA

Lu Liu, Benjamin F. Jones, Brian Uzzi & Dashun Wang

Northwestern Institute on Complex Systems, Northwestern University, Evanston, IL, USA

Kellogg School of Management, Northwestern University, Evanston, IL, USA

College of Information Sciences and Technology, Pennsylvania State University, University Park, PA, USA

National Bureau of Economic Research, Cambridge, MA, USA

Benjamin F. Jones

Brookings Institution, Washington, DC, USA

McCormick School of Engineering, Northwestern University, Evanston, IL, USA

  • Dashun Wang

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Dashun Wang .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Peer review

Peer review information.

Nature Human Behaviour thanks Ludo Waltman, Erin Leahey and Sarah Bratt for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article.

Liu, L., Jones, B.F., Uzzi, B. et al. Data, measurement and empirical methods in the science of science. Nat Hum Behav 7 , 1046–1058 (2023). https://doi.org/10.1038/s41562-023-01562-4

Download citation

Received : 30 June 2022

Accepted : 17 February 2023

Published : 01 June 2023

Issue Date : July 2023

DOI : https://doi.org/10.1038/s41562-023-01562-4

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

Publication, funding, and experimental data in support of human reference atlas construction and usage.

  • Yongxin Kong
  • Katy Börner

Scientific Data (2024)

Rescaling the disruption index reveals the universality of disruption distributions in science

  • Alex J. Yang
  • Hongcun Gong
  • Sanhong Deng

Scientometrics (2024)

Signs of Dysconscious Racism and Xenophobiaism in Knowledge Production and the Formation of Academic Researchers: A National Study

  • Dina Zoe Belluigi

Journal of Academic Ethics (2024)

Scientific Data (2023)

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

a measurement in research

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Rev Saude Publica

Logo of rspublica

What, what for and how? Developing measurement instruments in epidemiology

Michael reichenheim.

I Universidade do Estado do Rio de Janeiro, Instituto de Medicina Social Hésio Cordeiro, Departamento de Epidemiologia, Rio de Janeiro RJ , Brasil, Universidade do Estado do Rio de Janeiro. Instituto de Medicina Social Hésio Cordeiro. Departamento de Epidemiologia. Rio de Janeiro, RJ, Brasil

João Luiz Bastos

II Universidade Federal de Santa Catarina, Departamento de Saúde Pública, Florianópolis SC , Brasil, Universidade Federal de Santa Catarina. Departamento de Saúde Pública. Florianópolis, SC, Brasil

Authors’ Contribution: Conception and planning of the study: MER, JLB. Writing of the manuscript: MER, JLB. Final approval: MER, JLB.

The development and cross-cultural adaptation of measurement instruments have received less attention in methodological discussions, even though it is essential for epidemiological research. At the same time, the quality of epidemiological measurements is often below ideal standards for the construction of solid knowledge on the health-disease process. The scarcity of systematizations in the field about what, what for, and how to adequately measure intangible constructs contributes to this scenario. In this review, we propose a procedural model divided into phases and stages aimed at measuring constructs at acceptable levels of validity, reliability, and comparability. Underlying our proposal is the idea that not only some but several connected studies should be conducted to obtain appropriate measurement instruments. Implementing the model may contribute to broadening the interest in measurement instruments and, especially, addressing key epidemiological problems.

INTRODUCTION

Considered one of the pillars of public health, epidemiology is chiefly concerned with the frequency, distribution, and determinants or causes of health events in human populations 1 . By emphasizing these aspects, the measurement of related events — either dimensions of the health-disease process or factors that are causally related to it — is key in the development of research in the field. Epidemiologists employ considerable efforts to measure specific health-disease conditions, assess characteristics (of person, place, and time) that allow establishing comparisons and assessing variability, as well as address the processes underlying their occurrence in a given population domain 2 . Although there are exceptions, the epidemiological measurement of these processes and factors is predominantly quantitative, which allows the subsequent statistical analysis of their patterns of association in order to assess the health event and intervene upon it.

The measurement process is not a trivial activity. Rather, it is of considerable complexity and imposes important challenges. This process implies expressive conceptual rigor, in addition to the other issues discussed in greater detail further on in this article 6 , 7 . It is impossible to measure —within acceptable levels of validity and reliability— a phenomenon of epidemiological interest predicated on an ambiguous definition, be it among researchers or even in the population whose health-disease conditions are the object of study. The use of instruments with good psychometric properties is equally important to measure aspects of interest in a given population 6 . In their absence, not only validity and reliability of the measurements become questionable, but it is also more challenging to compare data across studies on the same health event 8 , limiting the proper construction of scientific knowledge on the research object. Knowledge is often established by the systematic accumulation and contrasting of research results that, by assumption, must be amenable to confrontation.

Albeit essential to any epidemiological research, measurement has received less emphasis in the methodological discussions pervading the field. While issues related to study design, potential biases, and statistical techniques often guide epidemiology courses and debates, relatively less space has been allocated to rigors and processes related to measurement. In this scenario, the need for a comprehensive assessment is clear, which includes the stages of theoretical construction and formal psychometric tests employed in the development or adaptation processes of measurement instruments. The authors of this paper were unable to find a discussion about the differences between what, what for, and how a measurement instrument should be developed—including the evaluation of its internal and external structures. The aim of this review is, therefore, to offer a set of guiding principles on possible paths to be followed for the development or cross-cultural adaptation of measurement instruments used in epidemiological studies. By proposing a procedural model comprised of sequential phases and stages, our expectation is that this study will contribute to improve the quality of knowledge production in the health field. We also hope that it will improve academic training in epidemiology. Encouraging the acquisition of information from the specific area of measurement can help students and researchers develop skills and competencies necessary to adhere to the proposal.

Our stance is eminently indicative, though, as the literature on the subject matter is complex and vast. We chose to focus on only a few points with immediate relevance and applicability to epidemiological practice. We used only widely recommended bibliographic references; some specific publications are also cited as suggestions to guide particular processes or decisions. We hope that this introduction will encourage broader readings on the covered topics.

RESEARCH SCENARIOS AND INSTRUMENT DEVELOPMENT OR ADAPTATION

Epidemiological studies require well-defined and socially relevant research questions, which, in turn, demand reliable and accurate measurements of the phenomena and concepts needed to answer them 8 . Berry et al. 9 discuss three perspectives that are particularly relevant for the issues at hand.

From the absolutist perspective, sociocultural nuances are disregarded in the interpretation of health-related events, thus assuming the possibility of unrestricted comparisons between quantitative measurements carried out across populations. In this case, a single measurement instrument could be widely employed in different populations, and results could be directly compared to consolidate scientific knowledge about the object of interest.

The relativist approach lies in a diametrically opposite position. Accordingly, sociocultural specificities are placed at the forefront, so that a different measurement instrument should be used for each population. This approach denies the possibility of quantitatively comparing measurements taken in socioculturally differentiated populations, since instruments would not be equivalent to each other, and the only way to contrast them would be through qualitative analyses.

The universalist perspective assumes an intermediate position, implying both the quantitative measurement of investigated phenomena and the possibility of comparisons between populations. This stance recognizes sociocultural nuances and the need to acknowledge them. If there is similarity in the way events are interpreted among different populations, it would be possible to pursue a so-called “universal” instrument, albeit adapted to each particular situation. According to this view, cross-cultural adaptation would ensure equivalence across different versions of the same instrument 10 . Its application would allow socio-culturally distinct populations to be quantitatively compared, based on equivalent measures of the same problem of interest.

The universalist approach 3 , 6 , 11 , 12 implies three possible scenarios, which must be evaluated and identified by the investigator when selecting the research instrument to be used in the study:

  • there is an established and adapted instrument for use in different populations, including the population of interest (Scenario 1);
  • an instrument is available, but it requires additional refinement, given its limited applicability to the population of interest, either because it requires complementary psychometric assessments, or because it still needs to be cross-culturally adapted (Scenario 2); or
  • no instrument is available and it is necessary to develop an entirely new one (Scenario 3).

In Scenario 2, cross-cultural adaptation studies are often needed, in which the concept of equivalence is a guiding principle 10 . Equivalence is usually unfolded in conceptual, item, semantic, operational, and measurement equivalence: all need to be assessed to consider an instrument as fully adapted. In Scenario 3, the researcher should adjourn the original research initiative and develop completely original instruments 13 . In this case, undertaking a parallel research program to develop an instrument capable of measuring the parameters of interest is necessary. This is crucial, since conducting the research without good measurement instruments puts the whole project at risk; it decreases the chances of contributing to the advancement of knowledge or of attending to a particular health need, becoming, thus, ethically reprehensible. Most of the time, epidemiological studies are conducted within the limits of Scenarios 2 and 3, the first being the most common in the Brazilian research context.

An important implication of working within these scenarios is the need to know, in detail, the state of the art of the available instruments. Such knowledge is essential for adapting, refining, or developing measurement tools.

PHASES OF DEVELOPMENT OR ADAPTATION OF INSTRUMENTS

Different procedural stages need to be followed whether the objective is to develop a new instrument or cross-culturally adapt an existing one. The Figure shows one proposed procedural model. The first phase elaborates and details the construct to be measured, which involves several steps: specifying, preparing, and refining the items regarding their empirical and semantic contents; detailing operational aspects, inter alia, the scenarios under which the instrument will be administered; and implementing several pre-tests to refine some aspects, such as item wording and their understanding by the target population. Provisionally called “prototypical” because it involves assembling one or more sketches of the instrument (i.e., prototypes or preliminary versions) to be subsequently tested, this first phase of the process is essential for achieving good results. This step is as essential in the development of a new instrument as in cross-cultural adaptations, in which the notion of equivalence (referred to in the previous section) requires thorough examination. This must be emphasized, since the efforts dedicated to this stage are often scarce in adaptation processes—if not completely ignored.

An external file that holds a picture, illustration, etc.
Object name is 1518-8787-rsp-55-40-gf01.jpg

Besides its importance in achieving a functional instrument, this first phase is not only essential from a substantive point of view—in the search for correspondence between the construct to be measured and the tool for doing so—but it also makes the next phase, testing of prototypes, more efficient. Enough time allocated for this stage and procedural rigor decrease the possibility of finding problems in subsequent validation studies, which are generally large and, therefore, more expensive. The worstcase scenario is to find pronounced deficiencies at the end of a long and intricate process, involving multiple interconnected studies, and having to go back to the field to test an almost new prototype.

The prototype specified in the previous phase is then examined in an extensive second phase, which we would call “psychometric.” Unlike the first one, in which qualitative approaches are more prominent, this second phase, as already suggested, comprises a sequence of larger quantitative studies. Expanding the upper right part of the Figure , the lower portion shows the various psychometric aspects of this phase. Two distinct segments should be noted: one concerns the internal structure of the instrument, covering its configural, metric, and scalar structures; and the other addresses its external validity, assessing whether patterns of association between estimates derived from the instrument and measures of other constructs agree with what is theoretically expected, for example.

Before detailing the proposed procedural model, it is worth mentioning the types of instruments to which the Figure refers. As should be clear throughout the text, the model we propose involves constructs (dimensions) in which the object of study intensifies or recedes according to a particular gradient. Although these types of constructs are common—e.g., diseases and injuries, depression, psychosocial events (violence), perceptions of health or quality of life—in some cases this increasing or decreasing severity or intensity is not applicable or does not matter much. A good example is what we might call “inventories”: a questionnaire to investigate whether an individual was ever exposed to a given chemical agent. Here, the instrument should contain a wide range of questions about potential contact situations over a period, with just one endorsement (hit) required to confirm the respondent’s exposure. Although one can think of a second instrument to capture the degree of exposure to this chemical agent—measuring the increasing intensity of this exposure—such an inventory would not focus on an underlying gradient. Another situation in which the model in the Figure would not be applicable refers to pragmatic instruments based on a set of risk predictors that are not theoretically linked to an underlying construct. An example would be a tool to predict the risk of dying from Covid-19 at the first contact of a patient with the health service, composed of variables covering several aspects, such as sociodemographic characteristics, health-related behaviors, pre-existing conditions, recent contacts with Covid-19 cases, or even admission exams. Though extremely important, this set of items would still not constitute a construct to be mapped.

In many other situations, the items of an instrument do not connect with a construct and/or form an explicit gradient of intensity. It is, therefore, up to the researcher to assess them and to evaluate whether a procedural model—such as the one proposed here—applies to the problem at hand. The following three sections provide some details about the two phases. It is worth pointing out that there are numerous paths to be followed and that our choice is only one of many. To the interested reader, we suggest checking the related bibliography; in the following section, we discuss some of it.

PROPOSING AN INSTRUMENT

The details of the first phase of the procedural model illustrated in the Figure can be found in Box 1 . Adapted from Wilson’s 13 proposal, the process contains five distinct stages. In the first, the theory supporting the construct is evaluated as to the extent to which it represents what one wants to measure. This representation is technically called the “construct map,” which outlines the ideas the developers (or adapters) of the instrument have about what is to be captured, including its gradient of intensity 13 . The construct map guides the process of developing items that will reflect the construct in question. The aim is to arrive at an efficient and effective set of items with good measurement properties. The goal is, at the end of the process, to identify those items that—in the most discriminating and orderly way possible—can map the metric space of the construct. Consisting of items positioned in the expected increasing gradient of intensity 13 . The empirical expression of the construct map is sometimes called the Wright map, which consists of the selected items positioned in the expected increasing gradient of intensity.

Evaluation stageDescription and purposeQuestions to be answeredTechnique/method/modelEmpirical expression
Evaluation of the theory upon which the construct is basedTheoretical appreciation of the construct that one desires to assess, in relation to both a potential multidimensionality and a gradient of intensity in each dimension. This stage develops the construct map (dimensional) .What is the definition of the construct of interest? Are there postulated subdimensions for the construct? Which are they? What would be the theoretical elements of this dimension(s) and how would they be organized in a gradient of intensity?Literature review. Consultation with experts.There is no empirical expression of this aspect, since the definition of the construct, its gradient of intensity, and its possible subdimensions are fundamentally theoretical questions.
Item content evaluationIdentification of the empirical manifestations of the dimension(s) and how they cover parts of the construct map. In this stage, a preliminary content validity (a.k.a. face validity) is proposed, connecting the empirical expression of the item content to the underlying theoretical elements.Do items have contents tied to the underlying dimension? Are the items distinct from each other in terms of content? Is each part of the construct map represented by a specific item? Do the items cover the construct map sufficiently and adequately (i.e., without gaps and/or occupying a similar position to other items)?Literature review. Consultation with experts. Qualitative approaches with members of the target population (in-depth interviews, focus groups, etc.) .Individually, each item reflects a specific part of the construct map. Together, the items should sufficiently and adequately cover the contents of the underlying construct (or, in case it is multidimensional, each constituent dimension).
Item semantics specificationWriting items to better convey their content to the respondent.Do the terms used in writing up the items allow item allow its direct and unambiguous connection to specific parts of the construct map?Consultation with linguistics experts and experts on the subject matter, as well as translators (in the case of adaptations) .Items of the instrument and their specific writing.
Evaluation of operational aspectsAssess and decide on how the instrument is to be administered—face-to-face interviews, self-completed forms, computer-assisted questionnaires etc.—which includes assessing the adequacy of the operational scenario. In this stage of the process, an evaluation of the contribution of each item to the construct map begins, including consideration of levels/categories of the outcome.What is the most appropriate mode of administration, considering the target population? In what operational scenario should the instrument be administered?Consultation with experts and members of the target population via qualitative studies .Mode(s) of application of the instrument in the desirable operational scenario. Any instrument should be evaluated in light of a preestablished operational scenario, preferably early on in its development process (or adaptation process).
Pre-tests (including preliminary reliability tests)Medium-sized studies (e.g., n = 100–150) aimed at evaluating: Acceptance, understanding, and emotional impact of the items. Formal aspects related to the sequence of items or rules for skipping them. Instrument response options, the operational scenario (operational aspects). This stage can also be used for preliminary reliability analyses, focusing on internal consistency, inter- and intra-observer agreement/test-retest etc.Does the instrument have an acceptable degree of understanding? Are the reactions aroused by the items in the respondents within what was expected? Does the sequence of items contribute to an easy administration for interviewers and/or respondents? Are the response options in line with respondents’ ability to discern them? Does the operational scenario favor the interaction between instrument and respondent, or interviewer and respondent? Are there indications of good reliability in preliminary studies (pre-tests)?Administration of the instrument in the target population, possibly including alternative formulations of the items. A sequence of studies should be carried out until one or more prototypes are obtained for the second phase of instrument development (or adaptation) .Records of the administration of the instrument in the target population. Reliability indicators (acceptability differs by type). See Reichenheim et al. for more details. See also Streiner et al. , Nunnally and Bernstein , Raykov and Marcoulides , Price and Shavelson and Webb .

Note: References: Streiner et al. 7 , Beatty et al. 42 , Moser and Kalton 43 , Bastos et al. 3 , Reichenheim and Moraes 6 , Johnson and Morgan 44 , DeVellis 45 ; Gorenstein et al. 46 Some of these references are occasionally marked, when necessary, along with other specific ones.

Moving from the theoretical-conceptual dimension of the construct map to the empirical dimension of the items requires contextualization and, thus, a good grasp of the population to which the instrument will be administered. On the one hand, the construct (and what it represents within the underlying theory) should be pertinent to the population in question. On the other hand, eligible items must be potentially endorsed in the desired context. It is always necessary to ask whether an item has some potential to be endorsed or whether negative answers to it do not stem from an intrinsic impossibility. As an example, we can mention an item on explicit discrimination experienced in the workplace asked to schoolchildren who still have not reached working age. Although somewhat obvious when pointed out, this is a common problem that requires careful consideration.

Once the construct map is specified, it is used to identify and develop the items that will be part of the instrument. At this stage, researchers should identify the various ways in which the construct manifests, including its different levels of intensity 13 . Box 1 distinguishes the process of identifying items from how these will be conveyed to respondents. These are, indeed, different tasks. The process of identifying potential items derives directly from the construct map, having to do with recognizing the empirical manifestations representing the outlined gradient of intensity. It concerns the content (meaning) of an item and not its form (wording). Syntactic and semantic questions come in later (third step), when the number of candidate items have been further restricted through sequential qualitative studies 3 , 6 .

The fourth stage of the first phase concerns operational issues, starting with the specification of the outcome space of each item. Identifying the type and number of response categories that items should contain is an important task. Like other eminently operative issues—instrument format, administration scenario etc.—debating and specifying the types of answers should be done early on, as soon as the target population of the instrument is identified. The third stage is then resumed with this focus: writing the qualifications of the response categories that were previously outlined/defined.

At this point, it is worth emphasizing that the validity of an instrument—its adequacy and performance—is dependent upon a close connection with the background content, attention to respondents’ cognitive and emotional capacity, and a productive environment in which answers can be provided with ethics, spontaneity, and safety. One should keep in mind that even a validated instrument can still underperform if administered to a population for which it was not originally developed or in an adverse operative context.

Item design and outcome specification require a first visit to the target population so that the first batches of prototypes (i.e., alternative and preliminary versions of the instrument) are assessed regarding acceptability, understanding, and emotional impact. A good strategy is to pre-test the instrument (fifth stage). Based on evidence from the pre-test, the most promising prototypes are then put to test in the next phase. Box 1 provides additional information and suggests several references for consultation.

ASSESSING INTERNAL STRUCTURE ADEQUACY

As already shown in the Figure , Box 2 expands the second phase of the development or adaptation of instruments: the structures to assess (configural, metric, and scalar); the properties under evaluation and the main questions to be answered; the models and analytical techniques used; as well as comments on what is expected of each property and how to evaluate it, including the demarcations guiding decisions.

Structure to evaluateProperty under evaluationQuestions to be answeredModel(s) /parameter(s)Comments
Configural(Assumed) DimensionalityDoes the configural structure assumed in the first phase (“prototypic”) arise? Can it be supported?PCA, EFA/ESEM, CFA. Preliminary eigenvalues, followed by the number of emerging factors in factor analyses.One expects that the proposed dimensionality in the previous phases will be corroborated; otherwise, it is worth exploring alternative dimensional structures. From a preliminary PCA perspective, this can be observed through the number of emerging > 1.0 eigenvalues. When the ratio between the first and second eigenvalue is greater than four, some authors suggest the possibility of unidimensionality. Going further with CFA, the amount of dimensions is evaluated through internal diagnosis suggesting poor configural specification (e.g., using Modification Indexes and Expected Parameter Changes via Lagrange Multiplier tests ). In the case of an analysis with ESEM , it is possible to observe directly alternative structures beyond those theoretically assumed.
Theoretical relevance of items (theoretical-empirical congruence)Do the items really belong in their respective dimensions, based on the results of the analysis?EFA/ESEM and/or CFA. Positioning or location of items in factors.The items should express their respective factors, distinct from each other, as planned in the instrument development or adaptation process. If any item manifests dimensions other than those theoretically predicted, it must be revised.
Factor specificityIs each item linked to only one dimension? Is there ambiguity?EFA/ESEM and/or CFA. Cross-loading items.If an item contains factorial specificity, the factor loading should not present ambiguity. The item is expected to be a unique expression of the factor it supposedly represents. Items violating this property should be identified and, depending on the situation, modified, or even replaced.
MetricReliability/discrimination of itemsWhat is the magnitude of the relationship between the items and the factors that underlie them?EFA/ESEM and/or CFA/IRT. Item loadings and residuals.For the item to be considered reliable, its factorial loading should be above a pre-specified demarcation. The literature does not stipulate a particular value. Conventionally, 0.30 , 0.35 , or 0.40 are considered acceptable cut-off points to admit an item as reliable. Reliability is also tied to the notion of discriminability, since factor loadings are related to IRT parameters, which express the discrimination of an item. By plotting curves from different a (corresponding to λ ), it is possible to visualize them in the Item Characteristic Curve and then make a decision.
Absence of redundancy of item contentDo items overlap in such a way that they do not map the construct independently?ESEM, CFA/IRT. Residual correlation (implying violation of conditional/local independence).In principle, it is expected that items of a given factor show no residual correlations. They are expected to be independent, once conditioned to the factor they supposedly reflect. Violation of independence implies that the variability of the items has another common source, in addition to the factor they represent. The magnitude of a residual correlation—from which a conditional independence violation can be inferred—is somewhat arbitrary. One possibility is to choose a theoretically sustainable value or level (for example, 0.20 or 0.25) and statistically compare models with or without the estimated residual correlation. Another possibility is to follow recommendations from authors to guide the decision-making process. Reeve et al. suggest the simple demarcation of ≥ 0.3 to admit the existence of residual correlation. Some demarcations are based on formal statistics. One is the Chi-square-based local dependence (LD χ ),proposed by Chen e Thissen , which uses the ≥ 10 cut-off point to indicate dependence. Another is the Q3 statistic (and variants), as suggested by Yen . Several situations lead to correlation between item residuals (errors) , but a common process in instrument development (or adaptation) refers to the presence of content (partial) redundancy between items (in general, pairs). Theoretical evaluation—observing semantics, and denotative and connotative meanings of the respective contents—should be sought when a statistical violation is observed.
Convergent factorial validity (CFV).Do the items convergently reflect the corresponding factor?CFA. Average Variance Extracted (AVE)CFV refers to each factor, as its name implies. It is understood that CFV occurs if the relationship between the AVE of the items—i.e., the variance that the items have in common—is at least greater than the joint variance of the respective errors, which express item variability due to other factors. Thus, quantitatively, the CFV is endorsed if the AVE ≥ 0,5 . From an interpretative perspective, endorsing CFV means accepting that the dimension (factor) in question is “well attended” by the respective set of items, since they contain more factor information than error (from sampling and/or measurement/process and/or inherent to the components ). A related indicator— √ —summarizes the construct reliability (dimension). Thus, values ≥ 0.7 also indicate convergence and, strictly, that it is internally consistent (i.e., consistency of/between items, internal to the factor to which they belong) .
Discriminating factorial validity (DFV)Is the amount of information captured by the set of items in their respective factors greater than that shared among the component factors (discriminant)?CFA. Contrast of the average variance extracted (by the items) of a given factor with the square of the correlations of this factor with the others of the system.This property only applies to multidimensional constructs. If there is DFV, a larger information “flow” is expected from the factors to the items than between the factors themselves. Demarcation of DFV violation may follow some generic rule of thumb or a more formal evaluation. Some authors suggest factorial correlations of 0.80 to < 0.85 as indicative of violation and ≥ 0.85 as violation per se . A more rigorous strategy is to formally test the statistical significance of the difference between the AVE of the factor and the square of its correlations with others . A positive and statistically significant sign of this difference would endorse DFV, while a statistically significant negative sign would favor its rejection, indicating violation. A nonsignificant positive or negative difference may be an indication for or against a violation. On a more conservative stance, a violation could be based only on a statistically significant difference.
ScalarCoverage of latent trait information (by each item and the set of items).Does the item set cover most of the latent trait or are there “unmapped” regions? In the latent trait regions effectively mapped, are the items evenly distributed or are there clusters indicating redundancy?Parametric IRT. Eyeballing, using the Wright Map, which consists of combining the construct map with estimates of the item placement obtained in the IRT and chart observation analyses.It is expected that items will be able to properly position individuals (or any other unit of analysis) along the construct map. The spectrum of variation predicted by the construct map should also be covered appropriately. One way to evaluate these two aspects is to critically assess the position of the items according to the proposed Wright Map . In this sense, the correspondence of item positioning is considered along the latent spectrum—for example, via b parameters obtained in IRT analyses—and the increasing intensity presented in the construct map . This eyeballing procedure should be followed by an analysis of the information coverage . Specific charts allow you to indicate whether the set of items covers most of the latent trait or if there are regions with gaps (without items). These graphs also help detect whether all latent trait regions are effectively covered, whether items are distributed evenly, or if there are clusters, indicating overlap and positioning/mapping redundancy. Additional graphic evaluations allow, in a complementary way, to assess the behavior of the items, especially regarding the latent trait coverage. Obtained by parametric IRT, these graphs include the Item Information Functions and Item Characteristic Curves. When items are polytomous, the Category Characteristic Curves are obtained. They also serve to evaluate the items “internally,” observing the coverage areas of each level and whether they are ordered according to the theoretical assumption of the construct map. Examples of these graphs can be found in the references cited at the end of this Table or in Internet searches (https://www.stata.com/manuals/irt.pdf).
Ordering according to item stability or monotonicity.Do items mapping regions of the construct map do so in the theoretically expected order of intensity or are there regions of the construct wherein less severe (lighter/milder) items supplant other items that, in principle, should be capturing more intense areas of the latent trait?Nonparametric and parametric IRT. Loevinger’s H, Mokken criterion and graphic assessments.The items should separate well the regions of the latent trait (content)—area that they supposedly cover—avoiding overlapping as much as possible. Two strategies allow checking this property: ordering according to scalability and monotonicity. Ordering items according to scalability refers to the coherence between the frequencies with which the items are endorsed and the part of the construct map that they should cover. In an ideal scenario, it is expected that a respondent with low intensity of a given latent trait of the construct (dimension) effectively endorses a representative item (mapper) of this region of “lower” intensity, while not endorsing another item that reflects a more intense degree of the construct. This aspect can be analyzed by item and by the whole set of the instrument. Loevinger’s H coefficient reflects this . With the value 1.0 as the upper limit of adequacy, an estimate of at least 0.3 is recommended for the set of items . An H below this value indicates an instrument with poor scalability. According to Mokken , values of 0.3 to < 0.4 indicate weak scalability; 0.4 to < 0.5, average; and ≥ 0.5, strong scalability. In an acceptable instrument, most of the H estimates of each item should also follow these references. The assumption of monotonicity is another related property to be appreciated during the evaluation of scalar behavior of each item and, by extension, of the set formed by them . Monotonicity can be supported when the probability of confirmation positive of an item increases according to the increase in intensity of the latent trait. Visually, there is a violation of simple monotonicity when the probability of endorsement declines as the total (latent) score grows. Additionally, a violation of double monotonicity occurs if there is any crossing along the curves of the items obtained in a IRT analysis. Whether single or double, monotonicity is present when the criterion suggested by Mokken is < 40 , understanding that some item crossings can be attributed to the sample variability. Values between 40 and 80 serve as a warning, demanding a more detailed evaluation by the researchers; a criterion higher than 80 raises doubts about the monotonicity hypothesis of an item, as well as the scale as a whole .

a Legend: ACP - principal component analysis; CFA - confirmatory factor analysis; AFE - exploratory factor analysis; ESEM - exploratory structural equation modeling; IRT - item response theory; CFV - convergent factorial validity; DFV - discriminating factorial validity; AVE - average variance extracted.

b References: Gorsuch 67 , Rummel 68 , Brown 17 , Kline 19 , Marsh et al. 48 , Embretson and Reise 62 , Bond and Fox 27 , De Boeck and Wilson 69 , Van der Linden 21 , Davidov et al. 30 Some of these references are occasionally marked, when necessary, along with other specific ones.

Box 2 highlights how many properties need to be scrutinized before judging the internal structure as adequate, thus endorsing this validity component of the instrument 15 , 16 . This is at odds with the general literature on the topic, in which the validity of an instrument tends to be accepted by somewhat sparse and weak evidence. Quite often, decisions on the acceptability of the instrument rely on a few factor analyses, using only model fit indices, demarcated by generic cut-off points (e.g., Root Mean Square Error of Approximation/RMSEA, Comparative Fit Index/CFI, Tucker-Lewis Index/TLI 17 ). These analyses usually fall short in further examining items and the scale(s) as a whole. Strictly speaking, the range of properties listed in Box 2 does not fit in single products (e.g., scientific articles), and serial studies are often necessary to visit one or more properties at a time. The methodological intricacies relating to each property certainly require detailing and greater editorial space.

A previously addressed point illustrates this fundamental rigor: the need for explicit demarcations to decide whether an item or scale meets the property under scrutiny. All estimators used in the evaluations require specific cut-off points, so that choices can be replicated or, when appropriate, criticized, rejected, or modified during the development or adaptation of an instrument. Box 2 offers some landmarks indicated in the literature. Beyond prescriptive benchmarks, these should serve as a stimulus to the empirical examination of an instrument. The main point is that the many decisions related to the psychometric suitability of an instrument need clear anchors, previously agreed upon by peers of the scientific community. The literature would certainly be enriched if these details extended to scientific articles.

One point to make regarding the procedural context in question is that multivariate analyses are used as diagnostic devices. As process tools, they must answer the central questions posited a priori . In this sense, it is necessary to distinguish eminently qualitative from quantitative issues related to a technical and methodological sphere. The third configural property presented in Box 2 serves as an example. Rather than simply verifying whether an exploratory factor analysis identifies cross-loadings, it is important to answer if a violation of factorial specificity effectively occurs, which would be antithetical to what was projected in the first phase of instrument development. A cross-loading suggests ambiguity in the item, and therefore that its clear-cut function as an “empirical representative” of the dimensional construct map was not fulfilled. Here, quantitative evidence meets qualitative findings, signaling a problem and the need for action, either by modifying the item semantics, or by replacing it with an item with better properties. The other properties demand the same approach.

In addition to the internal properties of items and scales summarized in Box 2 , two other related questions deserve mentioning. The first concerns the presumption of measurement invariance (configural, metric, and scalar) 17 . The assumption that the instrument performs similarly in different population subgroups is almost a rule. Often, it is tacitly assumed that the instrument functions equally well across groups (e.g., genders, age groups, strata with different levels of education or residing in different parts of the country), so that any differences observed between them are considered factual and not due to measurement problems. However, without further evidence, this is a difficult argument to sustain since inconsistent functioning of an instrument among subgroups of the population can lead to incorrect inferences and inefficient or even harmful health decisions and actions 20 . This demands stepping up research programs on measurement instruments. Beyond scrutinizing their properties, evaluating them in various population segments is also needed. To ensure invariance of the instrument in different population subgroups is to allow reliable comparisons.

Along with invariance is the issue of equalization and linking of instruments 22 . These concern the search for common metrics across instruments that supposedly capture the same construct, but hold different items and/or varied response options 25 , 26 . In both cases, one must be careful when summarizing and comparing studies. Study results may not be comparable—even if focused on the same construct—when they are conducted in different populations and with different instruments. Without equalization, measurement instruments may lack metric and scalar tuning.

An issue related to the scalar properties of an instrument concerns the appropriateness of grouping individuals when applying cut-off points to scores (whether crude scores, formed by the sum of item scores, or model-based scores, such as factor-based or Rasch scores 27 , 28 ). This point deserves attention, especially regarding the approaches frequently used in epidemiology. It is common to categorize a score into a few groups, by taking the mean, median, or some other “statistically interesting” parameter as a cut-off point. This procedure has downsides, however, since the study population is not necessarily partitioned into internally homogeneous and externally heterogeneous groups. Substantive knowledge on the subject matter is undoubtedly crucial in the process of grouping respondents appropriately, but the search for internally similar yet comparatively distinct groups may gain from using model-based approaches, such as latent class analyses or finite mixture models 29 .

ASSESSING CONNECTIONS BETWEEN CONSTRUCTS AND THEORIES

Box 3 proposes a typology that is in line with validity based on hypothesis testing presented in the early 2010s by the COSMIN initiative (COnsensus-based Standards for the selection of health Measurement INstruments) 15 , 16 , 33 . Contrary to the apparent conciseness of the suggested typology, this stage of the second phase of instrument evaluation implies a long process—perhaps as long as the study of the construct itself in all its relationships of causes and effects. Evoking other texts 7 , 11 , it is important to point out that the validity of an instrument ultimately corresponds to establishing validity of the theoretical construct that the instrument aims to measure. Somewhat circular and dismaying due to the long road it projects, this reasoning alerts us to how risky and reckless it is to support an instrument that has been assessed by only a few studies. Consolidating and eventually endorsing the suitability of an instrument requires many tests, both regarding its internal structure and its external connections.

Evaluation stageQuestions to be answeredTechnique/method/model Comments
Evaluation of relationships between the (sub)scales of the instrument.Are the (sub)scales of the instrument associated in the expected direction and magnitude?Parametric or nonparametric association tests between the (sub)scales of the instrument.This aspect could have already been contemplated in the assessment of discriminant validity involving factorial correlation, in the stage of evaluation of the internal structure. At this moment of analysis, however, the tests are already based on the scale scores themselves (whether crude or model-based), refined in previous stages, mainly regarding scalar structure.
Evaluation of relationships between (sub)scales with other instruments of the same construct that are not considered referenceDoes the instrument associate with another one that measures the same construct in a similar (convergent) way? At what magnitude?Comparison of extreme groups and parametric or nonparametric association tests.This stage concerns construct validity. Together, construct, content, and criterion validity are known as the three Cs described in many textbooks on classical measurement theory.
Evaluation of relationships between (sub)scales with another instrument (or procedure) considered reference for the construct itself.Is the instrument capable of measuring what is proposed when there is another one regarded as reference?Estimation of sensitivity, specificity, and area under the ROC (Receiver Operating Characteristic) curve of the instrument, based on a concurrent criterion (reference instrument) and/or a predicted (future) outcome.The literature traditionally calls this stage as criterion validity (one of the three Cs), subdivided into concurrent and predictive validity.
Evaluation of relationships between the (sub)scale with others outside of the construct in question.Does the instrument confirm the general predictions and hypotheses of the theory that involves it, i.e., its nomological network? Is the instrument unrelated to other constructs that are not part of the general theory that encompasses the phenomenon of interest?Multivariate data analysis, complex causal models, and other statistical techniques that allow analysis of relationships of interest with greater rigor and accuracy.Evaluation of relationships between the (sub)scale with others outside of the construct in question.

a References: Streiner et al. 7 , Bastos et al. 3 , Reichenheim and Moraes 6 , Lissitz 70 , Armitage et al. 71 , Corder and Foreman 72 , Kline 19 , Little 61 , Hernán and Robins 5 , VanderWeele 35 .

As suggested in Box 3 , external validation of an instrument ranges from simple tests of association between component subscales to intricate hypotheses tests about the construct—what scholars often take as the nomological network of interconnected concepts of a theory 5 , 7 , 34 , 35 . Whatever the level of complexity of the study, a question that arises—often in the context of scientific publications—is when an external validity study should be performed, given all the necessary prior steps to better know the intricacies of the instrument. Would it be worth conducting studies along the lines that Box 3 indicates, without first having some evidence about the sustainability of the instrument’s configural, metric and scalar structures? One should recognize that correlations between scales (e.g., the instrument in question and others that cover the same construct) may well occur even in the face of multiple psychometric insufficiencies at the internal level. What would these correlations mean, knowing, for example, that the set of items does not satisfactorily meet the requirements of factorial specificity, convergent factorial validity, and scalability? The answer based on the mere correlation would indicate external validity, but one could ask “of what?” if the ability to represent the underlying construct is flawed and uninformative. These questions cannot be answered clearly, but it is necessary to pose them before “blindly” carrying out external validity studies. The timing of these stages is a decision to be taken within each research program, but the saying “haste makes waste” serves as a reminder: little time and effort (and resources!) invested in one step can be double time and effort (and resources!) needed in a following step.

CONCLUDING REMARKS

This article clarifies that the development of a measurement instrument involves an extensive process, comprising multiple connected studies. This trajectory can be even longer and tortuous considering the need for replication studies or when certain psychometric studies raise fundamental questions that only returning to the prototypic phase of development may provide answers. This panorama contrasts sharply with the way epidemiologists often approach measurement instruments. Contrary to common practice, evidence on the adequacy of a measurement tool demands more than one or two studies on its dimensional structure, or the magnitude of factor loadings. This warning also extends to critical analyses of external validity that, as mentioned in the former section, require attention to the inner workings of the instrument.

The development and refining of different versions of the instrument are also vital, so that research carried out in distinct populations retains comparability and can be compared with each other. The cross-cultural adaptation process is as intricate as the development of a new instrument. All phases and stages apply equally to adaptation processes. In fact, a researcher performing a cross-cultural adaptation often finds a variety of gaps in the original research program giving rise to the instrument. Sometimes, they are problems related to the execution of previous studies; other (many) times, several properties have not even been assessed. In this case, the focus shifts from equivalence (see section on Research Scenarios) to the core of the structure of the instrument. This is not trivial, since the origin of the problem is always dubious: an intrinsic problem of the instrument or a problem in the process of cross-cultural adaptation. Be that as it may, examining an instrument in another sociocultural context requires even more time and effort. That is why many consider cross-adaptation as an additional construct validation step 33 .

A recurring question is whether all phases and stages need to be completed to deem an instrument suitable for research or for use within health services. This question is difficult to answer, but some milestones may guide us. One suggestion has already been offered in the section on the process stages: a well-planned and developed prototypic phase helps greatly to obtain favorable results in the second major phase of the process. Rigor in the first phase contributes to better psychometric properties; it also adds efficiency, as several problems tend to be solved or even avoided early on. Epidemiological studies in the psychometric phase are usually large and, therefore, rarely susceptible to replications to solve emerging issues.

Another guide is resorting to the fundamentals: always remembering the essence of each property and what its violation means. For example, would we firmly declare an instrument as valid and ready for use in light of a few exploratory factor analyses—preliminary stating a configural structure—and/or some studies correlating the score(s) of the (sub)scale(s) with certain sociodemographic variables as evidence on theoretical pertinence? Given the range of the substantive and procedural possibilities, would this be sufficient, or should we postpone the use of the instrument and obtain additional evidence to support its validity? We reiterate that a quick and prompt response does not exist, but that, perhaps, a rule can be useful for decision-making: even if we are not prepared to let the great mess the good—or even let the good get in the way of the reasonable—it may be worth letting the reasonable get in the way of the bad. Although this is a subjective perspective, always negotiable among peers, if put into practice it will possibly lead us to better instruments and, as we have already pointed out, to better results and comparisons between studies or health interventions.

The continuous development, refinement and adaptation of measurement instruments should be an integral part of epidemiologic research. Knowledge construction requires instruments with acceptable levels of validity and reliability, up to par with the level of rigor commonly required in the elaboration of study designs and their complex analyses. Meticulousness and rigor in these spheres are pointless if the dialogue between publications and appreciation of consistent scientific evidence fail due to precariousness of measurement instruments. As products focused on collective use, measurement instruments require development processes that resemble those found for medicines or other health technologies. And as such, they deserve care and dedication.

Funding Statement

Funding: MER was funded in part by the Conselho Nacional de Desenvolvimento Científico e Tecnológico do Brasil (CNPq - Process 301381/2017-8). JLB was funded in part by the Conselho Nacional de Desenvolvimento Científico e Tecnológico do Brasil (CNPq - Process 304503/2018-5).

  • Rev Saude Publica. 2021; 55: 40.

O quê, para quê e como? Desenvolvendo instrumentos de aferição em epidemiologia

I Brasil, Universidade do Estado do Rio de Janeiro. Instituto de Medicina Social Hésio Cordeiro. Departamento de Epidemiologia. Rio de Janeiro, RJ, Brasil

II Brasil, Universidade Federal de Santa Catarina. Departamento de Saúde Pública. Florianópolis, SC, Brasil

Contribuição dos Autores: Concepção e planejamento do estudo: MER, JLB. Preparação e redação do manuscrito: MER, JLB. Aprovação final: MER, JLB.

Embora fundamental para a pesquisa epidemiológica, o desenvolvimento e a adaptação transcultural de instrumentos de aferição têm recebido menos destaque nas discussões metodológicas que permeiam o campo. Em paralelo, a qualidade das mensurações realizadas em muitos estudos epidemiológicos está frequentemente aquém do desejado para a construção de conhecimento sólido sobre o processo saúde-doença. A escassez de sistematizações sobre o que, para que e como aferir na área provavelmente contribui para esse cenário. Nesta revisão, propomos um modelo processual composto por uma sequência de etapas voltadas à mensuração de construtos em níveis aceitáveis de validade, confiabilidade e, por extensão, comparabilidade. Subjaz à proposta a ideia de que não apenas alguns, mas diversos estudos concatenados entre si e sucessivamente mais aprofundados devem ser conduzidos para obter aferições adequadas. A implementação do modelo poderá contribuir para alargar o interesse sobre instrumentos de aferição e, especialmente, para enfrentar os problemas investigados em epidemiologia.

INTRODUÇÃO

Considerada um dos pilares da saúde pública, a epidemiologia se preocupa fundamentalmente com a frequência, a distribuição e os determinantes ou as causas de eventos de saúde em populações humanas 1 . Ao enfatizar esses aspectos, a atividade relacionada à mensuração dos fenômenos de interesse – sejam eles dimensões do processo saúde-doença ou questões que o condicionam – assume centralidade nas pesquisas desenvolvidas na área. O epidemiologista emprega parte considerável de seus esforços na mensuração de condições específicas de saúde-doença, de características (de pessoa, lugar e tempo) que permitam observar sua variabilidade e de processos subjacentes à sua ocorrência em um determinado domínio populacional 2 . Embora haja exceções, a mensuração epidemiológica dos aspectos mencionados é predominantemente quantitativa, o que permite a subsequente análise estatística de seus padrões de associação com vistas a apreciar o evento de saúde e intervir sobre ele 3 .

Ainda que central à epidemiologia, o processo de mensuração não consiste em uma atividade trivial. Pelo contrário, é de considerável complexidade, impondo importantes desafios a serem enfrentados. Tal processo implica expressivo rigor conceitual, além das outras questões discutidas em maior detalhe neste artigo 6 , 7 . Não é possível mensurar com níveis aceitáveis de validade e confiabilidade um fenômeno cuja definição é ambígua entre os pesquisadores ou a própria população cujas condições de saúde-doença são objeto de apreciação. Igualmente importante é o uso de instrumentos com boas propriedades psicométricas para mensurar os aspectos de interesse na população 6 . Na ausência destes, não somente a validade e a confiabilidade das mensurações são potencialmente postas em xeque, como também fica mais difícil comparar os dados com os de outras pesquisas acerca do mesmo evento de saúde 8 , limitando, por sua vez, a construção própria do conhecimento científico delimitado no objeto de investigação. Tal conhecimento frequentemente se estabelece pelo acúmulo e contraste sistemático de resultados de pesquisas que, por pressuposto, requerem ser passíveis de confrontação.

Embora fundamental para a pesquisa epidemiológica, há que se reconhecer que a atividade de mensuração tem recebido menos destaque nas discussões metodológicas que permeiam o campo. Enquanto questões ligadas aos desenhos de estudo, aos possíveis vieses e às técnicas estatísticas frequentemente pautam os cursos de epidemiologia e os debates travados entre pesquisadores da área, relativamente menos espaço tem sido destinado aos rigores e aos processos atinentes à mensuração. Neste cenário, observa-se a necessidade de uma apreciação abrangente, que inclua desde as etapas de construção teórica até testes psicométricos formais, empregados no processo de desenvolvimento ou de adaptação dos instrumentos de aferição. Os autores deste estudo não localizaram na literatura uma discussão sobre as diferenças entre o que, para que e como no âmbito do desenvolvimento de um instrumento de aferição, incluindo a avaliação de suas estruturas interna e externa. O objetivo desta revisão é, portanto, oferecer um conjunto de sugestões sobre possíveis percursos a serem seguidos no desenvolvimento ou na adaptação transcultural (ATC) de instrumentos de aferição utilizados em estudos epidemiológicos. Ao propormos um modelo processual composto por uma sequência de etapas, nossa expectativa é que o texto contribua para melhorar a qualidade da produção do conhecimento sobre saúde. Esperamos, também, que o artigo sirva para aprimorar a formação acadêmica em epidemiologia. O incentivo à aquisição de informações na área específica da medição pode estimular estudantes e vindouros pesquisadores a dotar-se das habilidades e competências necessárias para aderir à proposta.

Nossa postura é eminentemente indicativa, uma vez que a literatura sobre o tema é complexa e vasta. Optamos por aprofundar apenas alguns pontos que entendemos de recorte aplicado e de relevância imediata à prática epidemiológica. Nesse movimento, procuramos nos ater a referências bibliográficas amplamente recomendadas, adicionadas de algumas publicações pontuais, como sugestões para demarcar alguma conduta ou decisão específica. Ainda assim, esperamos que esta introdução incentive leituras mais amplas em sequência.

CENÁRIOS DE PESQUISA E DESENVOLVIMENTO OU ADAPTAÇÃO DE INSTRUMENTOS

Estudos epidemiológicos exigem perguntas suficientemente delimitadas e socialmente relevantes, as quais requerem mensurações confiáveis e acuradas dos fenômenos e conceitos necessários para respondê-las 8 . Berry et al. 9 discutem três perspectivas a serem adotadas nesta direção.

Na perspectiva absolutista, desconsideram-se nuances socioculturais na interpretação dos eventos de interesse e assume-se, portanto, que há possibilidade de comparação irrestrita de aferições quantitativas levadas a cabo em diferentes populações. Neste caso, salvo a necessidade de tradução literal para os idiomas pertinentes, um único instrumento de aferição poderia ser amplamente utilizado nas mais variadas populações, podendo-se comparar diretamente os seus resultados com vistas à consolidação do conhecimento científico sobre o objeto de interesse.

Em situação diametralmente oposta encontra-se a abordagem relativista, a qual eleva as especificidades socioculturais à sua máxima importância, pressupondo que um instrumento de aferição diferente seja utilizado para cada nova população investigada. Essa abordagem nega a possibilidade de comparação quantitativa de medidas realizadas em populações socioculturalmente diferenciadas, visto que os instrumentos não seriam equivalentes entre si, e a única forma de contrastá-las seria por meio de análises qualitativas.

Assumindo uma posição intermediária está a perspectiva universalista, que implica tanto a aferição quantitativa de fenômenos em investigação quanto a possibilidade (mas, não garantia) de comparação entre populações distintas por meio dessa medida. Essa posição reconhece as nuances socioculturais e apregoa que devem ser levadas em consideração. Havendo semelhança na forma como os eventos são interpretados em diferentes populações, seria possível prosseguir com a utilização de um instrumento dito “universal”, mas adaptado para cada situação particular. Nessa visão, a ATC garantiria a equivalência entre as suas diversas versões 10 . Sua aplicação permitiria que populações socioculturalmente distintas fossem comparadas quantitativamente a partir de medidas equivalentes do mesmo problema de interesse.

A abordagem universalista 3 , 6 , 11 , 12 implica três possíveis cenários, que devem ser avaliados e identificados pelo pesquisador para escolher o instrumental de pesquisa, respondendo se:

  • existe um instrumento consagrado e adaptado para uso em distintas populações, incluindo a de interesse (Cenário 1);
  • há instrumento disponível, mas seu uso requer cautela ou refinamento adicional, dada sua ainda limitada aplicabilidade à população em tela, seja por necessitar de avaliações psicométricas complementares ou porque ainda precisa ser submetido a um processo de ATC (Cenário 2); ou
  • inexiste instrumento, sendo necessário propor o desenvolvimento de um inteiramente novo (Cenário 3).

No Cenário 2, frequentemente há necessidade de desenvolver estudos de ATC, nos quais o conceito de equivalência deve ser tomado como norte 10 . A equivalência é usualmente desdobrada em conceitual, de itens, semântica, operacional e de mensuração, as quais requerem uma avaliação meticulosa para que se considere um instrumento plenamente adaptado. No Cenário 3, por sua vez, o pesquisador deve suspender a iniciativa original de pesquisa e propor o desenvolvimento de instrumental completamente original 13 . Aqui, é preciso empreender um programa de investigação paralelo que vise gerar um instrumento capaz de produzir as medidas de interesse. Isso é crucial, uma vez que seguir com a pesquisa sem bons instrumentos de aferição põe todo o projeto a perder, diminuindo suas chances de contribuir com o avanço do conhecimento ou de atender a uma necessidade de saúde, tornando-se, assim, eticamente condenável. Na maioria das vezes, os estudos epidemiológicos são conduzidos nos limites dos Cenários 2 e 3, o primeiro sendo o mais comum e afeito ao contexto brasileiro de pesquisa.

Uma implicação importante de trabalhar em meio a esses cenários é a necessidade de conhecer detalhadamente o estado da arte do desenvolvimento dos instrumentos disponíveis. Tal conhecimento é imprescindível para proceder tanto a uma ATC e/ou ao refinamento de instrumentos de aferição pré-existentes quanto para o desenvolvimento de novo instrumental e a subsequente condução da pesquisa epidemiológica de fundo.

FASES A PERCORRER NO PROCESSO DE DESENVOLVIMENTO OU ADAPTAÇÃO DE INSTRUMENTOS DE AFERIÇÃO

Seja no caso da proposição de um novo instrumento ou de uma ATC, vislumbram-se etapas processuais distintas, ainda que complementares e interativas. A Figura esquematiza um modelo processual a ser adotado.

An external file that holds a picture, illustration, etc.
Object name is 1518-8787-rsp-55-40-gf01-pt.jpg

A primeira fase visa a elaboração e o detalhamento conceitual do construto a ser medido; a especificação, a confecção e o refinamento dos itens quanto aos seus conteúdos empíricos e semânticos; a pormenorização dos aspectos operacionais, incluindo os cenários de aplicação admissíveis para o instrumento; e várias jornadas de pré-testes para alcançar sintonia fina, como melhorias na redação e na compreensão da população-alvo quanto aos itens. Provisoriamente denominada “prototípica” por encerrar as etapas de construção de um ou vários esboços do instrumento (i.e., protótipos ou versões preliminares) a serem subsequentemente testados, esta primeira fase do processo é essencial para os bons frutos finais. Se, na perspectiva do desenvolvimento de um novo instrumento, esse passo é claramente imperativo, ele não é menos importante em ATC, em que a noção de equivalência (referida na seção anterior) exige um exame minucioso. Este ponto requer ênfase, uma vez que os esforços dedicados às suas fases constituintes são frequentemente parcos nos processos de ATC, quando não totalmente ignorados.

Além de sua central relevância para o alcance de um instrumento funcional, apuro nesta primeira fase não é somente importante do ponto de vista substantivo – na busca de correspondência entre o construto a ser aferido e a ferramenta para sua mensuração –, mas também torna mais eficiente a fase seguinte, de teste dos protótipos. Tempo empenhado e rigor procedimental nesta fase diminuem a possibilidade de se encontrar impropriedades nos estudos de validação subsequentes, que são geralmente de grande porte e, portanto, bastante dispendiosos. O pior cenário é encontrar deficiências marcantes no final de um longo e intrincado processo envolvendo múltiplos estudos encadeados ( cf . próxima seção), ter de voltar a fases anteriores de desenvolvimento e retornar ao campo para testar um protótipo praticamente novo.

O protótipo especificado na fase anterior é, então, examinado em uma segunda grande fase, que, também provisoriamente, poderia ser cunhada de “psicométrica”. Diferentemente da primeira, em que preponderam abordagens qualitativas, esta segunda fase, como já aventado, encerra uma sequência de estudos quantitativos de maior porte. Expandindo a parte superior direita da Figura , a porção inferior mostra os diversos aspectos psicométricos que compõem essa fase. Há dois segmentos distintos: um concerne à validade da estrutura interna do instrumento, cobrindo o exame de suas estruturas configural, métricas e escalares; outro aborda sua validade externa, permitindo verificar se o seu comportamento – relativo a medidas de outros construtos, por exemplo, – está de acordo com o que teoricamente se espera.

Antes de passar para o detalhamento do modelo processual proposto, vale uma ressalva sobre os tipos de instrumentos aos quais a Figura se refere. Como deverá ficar claro ao longo do texto, o modelo que propomos envolve, principalmente, construtos (dimensões) em que o objeto em tela, por pressuposto, se intensifica ou remite em gradiente. Ainda que estes tipos de construtos sejam muito frequentes – e.g., doenças e agravos (e.g., depressão), eventos psicossociais (e.g., violências), percepções de saúde ou de qualidade de vida –, há situações em que este (de)crescente de gravidade ou intensidade não se aplica ou não importa tanto.

Um bom exemplo está no que poderíamos chamar de “inventários”, como seria um questionário para investigar se um indivíduo já se expôs a algum agente químico. Aqui, o instrumento deveria conter uma gama de perguntas sobre situações de contato potencial ao longo de um determinado período, sendo o endosso a ao menos uma destas situações a própria positivação do respondente. Ainda que se possa pensar em um segundo instrumento para captar o grau de exposição a este agente químico – e que, portanto, estaria aferindo a intensidade crescente dessa exposição –, para os fins propostos o questionário em tela prescindiria dessa qualidade. Outra situação à qual o modelo na Figura não se aplicaria se refere a instrumentos pragmáticos, baseados em um conjunto de variáveis preditoras de risco que, no entanto, não estariam vinculadas teoricamente a um construto. Um exemplo seria uma ferramenta de predição de risco de letalidade da covid-19 a ser usada ao primeiro contato de um paciente com o serviço de saúde, composta por variáveis que cobrissem diferentes ângulos, como o sociodemográfico, hábitos de vida, condição patológica pregressa, prática preventiva, trajetória e contatos recentes ou, ainda, exames de entrada. Mesmo que claramente de extrema valia, não haveria um construto definido a ser mapeado por esse conjunto.

Há, por certo, muitas outras situações em que itens componentes de um instrumento não se conectam teoricamente e/ou formam um explícito gradiente de intensidade. Compete, pois, ao pesquisador discerni-las e avaliar se um modelo processual como o proposto aqui é pertinente. Alguns pormenores sobre suas duas fases são oferecidos nas três seções seguintes. Vale apontar, no entanto, que as possibilidades são amplas e que, assim sendo, nossa escolha é forçosamente uma de muitas. Ao leitor interessado sugerimos recorrer à bibliografia conexa, da qual oferecemos alguns artigos e livros de interesse no texto que segue.

ASPECTOS A AVALIAR NA PROPOSIÇÃO DE UM INSTRUMENTO

Os detalhes da primeira fase do modelo processual ilustrada na Figura estão no Quadro 1 . Adaptado da proposta de Wilson 13 , o processo encerra cinco etapas distintas. Na primeira, avalia-se a teoria que embasa o construto com vistas à representação do que se quer medir. Essa representação é denominada tecnicamente de “mapa do construto”, a qual delineia as ideias que os desenvolvedores (ou, dependendo, os adaptadores) do instrumento têm sobre o que está por ser captado, incluindo seu gradiente de intensidade 13 . É a partir do mapa do construto que se alicerça a busca dos itens para representá-lo. De muitos possíveis, a proposta é se chegar a um conjunto eficiente e efetivo de itens que contenham boas propriedades de mensuração. A meta é, ao fim do processo, identificar aqueles que, de forma mais discriminante e escalonada possível, consigam mapear o espaço métrico do construto. Constituído de itens posicionados no esperado gradiente crescente de intensidade, o mapa de Wright 13 é a expressão empírica do mapa do construto.

Etapa de avaliaçãoDescrição e propósitoQuestões a serem respondidasTécnica/método/modeloExpressão empírica
Avaliação da teoria que embasa o construtoApreciação teórica do construto que se deseja aferir, tanto em relação a uma potencial multidimensionalidade quanto ao gradiente de intensidade nestas dimensões. Esta é a etapa de desenvolvimento do mapa do construto (dimensional) .Qual é a definição do construto de interesse? Há subdimensões postuladas para o construto? Quais são elas? Quais seriam os elementos teóricos dessa(s) dimensão(ões) e como se organizariam em um crescente (gradiente) de intensidade?Revisão de literatura. Consulta a especialistas.Não há expressão empírica deste aspecto, visto que a definição do construto, seu gradiente de intensidade e suas possíveis subdimensões são questões fundamentalmente teóricas.
Avaliação de conteúdo de itensIdentificação das manifestações empíricas dos componentes da(s) dimensão(ões) e de como estes manifestos cobrem porções do mapa do construto. Nesta etapa se propõe uma validade de conteúdo (também conhecido como validade de face), conectando-se a expressão empírica aos elementos teóricos subjacentes.Os itens do instrumento têm conteúdos próprios e vinculados à dimensão subjacente? Os itens são distintos entre si em termos de conteúdo? Cada porção do mapa do construto está representada por itens específicos? Em seu conjunto, os itens cobrem suficiente e adequadamente o mapa do construto (i.e., sem deixar lacunas e/ou ocupar posição semelhante à de outros itens)?Revisão de literatura. Consulta a especialistas. Abordagens qualitativas com membros da população-alvo (entrevistas em profundidade, grupos focais etc.) .Individualmente, cada item reflete uma porção específica do mapa do construto. Em conjunto, os itens devem cobrir suficiente e adequadamente o conteúdo do construto subjacente (ou, no caso de este ser multidimensional, cada dimensão constituinte).
Especificação da semântica de itensRedação dos itens com vistas a maximizar a transmissão de seus conteúdos ao respondente.Os termos empregados na redação de cada item permitem sua vinculação direta e inequívoca a porções específicas do mapa do construto?Consulta a especialistas em linguística e no objeto de pesquisa em questão, bem como a tradutores (no caso de ATC) .Itens do instrumento e sua redação específica.
Avaliação de aspectos operacionaisApreciar e decidir sobre o modo de aplicação do instrumento – por exemplo, face a face, autopreenchimento, eletrônica etc. – na população-alvo, o que inclui avaliar a adequação do cenário de administração. Nesta fase do processo se dá início também a uma avaliação sobre a contribuição de cada item ao mapa do construto, discutindo-se níveis/categorias do desfecho.Qual é o modo de aplicação mais adequado, considerando a população-alvo de interesse? Em que cenário operacional o instrumento deve ser administrado?Consulta a especialistas e a membros da população-alvo via estudos qualitativos .Modo(s) de aplicação do instrumento no cenário operacional desejável para sua utilização. Qualquer instrumento deve ser avaliado à luz de um cenário operacional pré-estabelecido, de preferência já no seu processo de desenvolvimento (ou ATC).
Pré-testes (incluindo testes preliminares de confiabilidade)Estudos de médio porte (e.g., n = 100–150) visando avaliar: Aceitação, compreensão e impacto emocional dos itens. Aspectos formais relativos à sequência dos itens ou regras de pulos. Opções de resposta do instrumento, o contexto de aplicação (aspectos operacionais). Esta etapa também pode ser utilizada para análises preliminares de confiabilidade, focalizando consistência interna, concordância inter e intraobservador/teste-reteste etc.O instrumento apresenta grau aceitável de compreensão? As reações que os itens provocam nos respondentes estão dentro do esperado? A sequência e o encadeamento dos itens contribuem para uma fácil administração para os aplicadores e/ou respondentes? As opções de resposta estão sintonizadas com a capacidade de discernimento dos respondentes? O contexto de aplicação favorece a interação entre instrumento e respondente ou entrevistador e respondente? Há indícios de boa confiabilidade nos estudos preliminares ( pré-teste)?Aplicação do instrumento na população-alvo, incluindo, possivelmente, formulações alternativas dos itens. Deve-se executar uma sequência de estudos (jornadas) até se obter um ou mais protótipos a serem utilizados na segunda fase do processo de desenvolvimento (ou ATC) do instrumento .Registros da experiência de aplicação do instrumento a membros da população-alvo. Indicadores de confiabilidade (as demarcações de aceitabilidade diferem segundo o tipo). Consulte Reichenheim et al. para maiores detalhes. Consultar também Streiner et al. , Nunnally e Bernstein , Raykov e Marcoulides , Price e Shavelson e Webb .

* Referências de fundamentação: Streiner et al. 7 , Beatty et al. 42 , Moser e Kalton 43 , Bastos et al. 3 , Reichenheim e Moraes 6 , Johnson e Morgan 44 , DeVellis 45 ; Gorenstein et al. 46 Algumas destas referências são também assinaladas ocasionalmente, quando necessário, juntamente com outras específicas.

É preciso entender que o processo de translado do plano teórico-conceitual ao empírico requer contextualização e, assim, uma boa compreensão sobre a população-alvo na qual se intenciona empregar o instrumento em desenvolvimento ou ATC. Por um lado, o construto (e o que este representa no âmbito da teoria subjacente) pressupõe pertinência ao domínio populacional em tela. Por outro, é necessário que os itens elegíveis tenham potencialidade de endosso no contexto previsto. Cabe sempre perguntar se a resposta a um item tem como se realizar e se uma potencial negativação não advém de uma impossibilidade intrínseca. Como exemplo podemos citar um item sobre discriminação explícita vivenciada no domínio laboral perguntado a escolares que ainda não alcançaram o mercado de trabalho. Ainda que um tanto óbvio quando destacado, trata-se de um problema bastante comum e que requer atenção constante.

Uma vez especificado o mapa do construto, passa-se para a identificação e confecção dos itens que comporão o instrumento. É nessa etapa que os pesquisadores deverão identificar as variadas formas pelas quais o construto se manifesta, incluindo suas diferentes intensidades 13 . De fato, o Quadro 1 distingue o processo de identificação de itens da elaboração de como estes serão transmitidos aos respondentes. São, efetivamente, afazeres distintos. O processo de identificação de potenciais itens deriva diretamente do mapa do construto, tendo a ver com o reconhecimento das manifestações empíricas que representam o gradiente de intensidade esboçado. Diz respeito ao conteúdo (significado) de cada item e não à sua forma (redação). Questões sintáticas e semânticas vêm depois (terceira etapa), quando já se tem um número mais restrito de itens candidatos, selecionados por meio de estudos qualitativos sequenciais 3 , 6 .

A quarta etapa da primeira fase concerne às questões operacionais, a começar pela especificação do espaço de desfecho de cada item. Identificar o tipo e número de categorias de resposta que cada item deve conter não é algo secundário. Como outras questões eminentemente operativas – formato do instrumento, mídia de veiculação, cenário de aplicação etc. –, debater e especificar o espaço de desfecho dos itens é algo a ser visto precocemente, tão logo seja identificado o público-alvo para o qual o instrumento será direcionado. É com esse foco que, subsequentemente, é preciso retomar a terceira etapa, redigindo-se as qualificações das categorias de resposta antes acordadas no âmbito do conteúdo.

Nesse ponto, vale a pena sublinhar que a validade de um instrumento – sua adequação e seu desempenho – não ocorre em um vazio, mas depende de estreita sintonia com o conteúdo de fundo, da atenção à capacidade cognitiva e emocional dos respondentes e de um ambiente profícuo no qual respostas podem ser oferecidas com ética, espontaneidade e segurança. É preciso lembrar que um instrumento já muitas vezes validado pode ter um desempenho aquém do esperado se for aplicado a uma população para o qual não foi originalmente confeccionado ou em um contexto operativo adverso.

As etapas de desenho de itens e de especificação do espaço do desfecho contemplam uma primeira visita à população-alvo para que os primeiros lotes de protótipos (i.e., versões alternativas e preliminares do instrumento) sejam submetidos a uma avaliação de aceitabilidade, compreensão e impacto emocional. Uma estratégia atraente é pré-testar o instrumento (quinta etapa). A partir das evidências encontradas no pré-teste são escolhidos os protótipos mais promissores, que deverão ser testados formalmente na fase seguinte. O Quadro 1 oferece algumas informações adicionais, bem como sugere diversas referências para consulta.

AVALIAÇÃO DE ADEQUAÇÃO DE ESTRUTURA INTERNA

Já anunciada na Figura , esta etapa da segunda fase de desenvolvimento ou ATC de instrumentos é aprofundada no Quadro 2 , no qual são apresentados: as estruturas a serem avaliadas (configural, métrica e escalar); as respectivas propriedades em avaliação e as principais questões que demandam resposta; os modelos e as técnicas de análise envolvidos; além de comentários sobre o que se espera de cada propriedade visitada e como avaliá-la, inclusive quanto às demarcações que norteiam decisões.

Estrutura a avaliarPropriedade em avaliaçãoQuestões a serem respondidasModelo(s) /parâmetro(s)Comentários
ConfiguralDimensionalidade (conjeturada)A estrutura configural proposta na primeira fase (“prototípica”) emerge? Pode ser corroborada?ACP, AFE/MEEE, AFC. Autovalores, de forma preliminar, seguidos de número de fatores emergentes nas análises fatoriais.Espera-se que a dimensionalidade proposta nas fases anteriores seja corroborada; caso contrário, cabe explorar estruturas dimensionais alternativas. Na perspectiva de uma análise preliminar, com ACP, isso pode ser observado mediante o número de autovalores > 1,0 emergentes. Quando a razão entre o primeiro e o segundo autovalor é maior do que quatro, alguns autores sugerem a possibilidade de unidimensionalidade . Aprofundando com AFC, o número de dimensões é avaliada por meio de diagnóstico interino sugerindo má especificação configural (e.g., usando Índices de Modificação e Mudanças Esperadas de Parâmetros via Multiplicador de Lagrange ). No caso de análises por meio de MEEE , é possível observar diretamente estruturas alternativas para além da conjeturada teoricamente.
Pertinência teórica de itens (congruência teórico-empírica)A previsão de pertencimento dos itens em suas respectivas dimensões pode ser amparada nos resultados da análise?AFE/MEEE e/ou AFC. Posicionamento ou localização de itens em fatores.Os itens devem expressar seus respectivos fatores, distintos entre si, conforme pensado para o instrumento na sua concepção ou em um processo de ATC. Se algum item manifesta dimensões que não aquela prevista teoricamente, há necessidade de revisão.
Especificidade fatorialCada item se vincula a apenas uma dimensão ou não? Há ambiguidade de pertencimento?AFE/MEEE e/ou AFC. Carga cruzada de itens.Se um item encerra especificidade fatorial, espera-se que não haja ambiguidade quanto ao carregamento fatorial. Espera-se que o item seja uma expressão exclusiva do fator que supostamente representa. Itens que violam esta propriedade devem ser identificados e, dependendo da situação, modificados ou mesmo substituídos.
MétricaConfiabilidade/discriminância de itensQual é a magnitude da relação entre os itens e os fatores que pressupostamente os manifestam?AFE/MEEE e/ou AFC/TRI. Carga e resíduo dos itens.Para que o item seja considerado confiável, espera-se que sua carga fatorial esteja acima de uma demarcação pré-especificada. A literatura não estipula um valor único. Convencionalmente, 0,30 , 0,35 ou 0,40 são pontos de corte considerados aceitáveis para admitir um item como confiável. A confiabilidade também se atrela à noção de discriminabilidade, uma vez que cargas fatoriais têm relação direta com os parâmetros obtidos nos modelos TRI, expressando o quão discriminante um item é . Plotando-se curvas de diferentes (correspondentes aos λ ), é possível visualizar a discriminância via Curva Característica do Item e decidir.
Ausência de redundância (particularidade) de conteúdo de itens.Há itens cujo conteúdo se sobrepõe de tal sorte que não mapeiam o construto independentemente?MEEE, AFC/TRI. Correlação residual (implicando violação de independência condicional/local).Em princípio, espera-se que os itens de um determinado fator não mostrem correlações residuais. Espera-se que sejam independentes, uma vez condicionados ao fator que pressupostamente manifestam. Violação de independência implica assumir que as variabilidades dos itens envolvidos se devam a outro motivo comum, para além do tal fator que representam. A magnitude de uma correlação residual a partir da qual se pode endossar uma violação de independência condicional é um tanto arbitrária. Uma possibilidade é demarcar um valor (ou patamar) pré-determinado teoricamente sustentável – por exemplo, 0,20 ou 0,25 – e contrastar estatisticamente os modelos com ou sem a correlação residual livremente estimada. Outra possibilidade é seguir recomendações de autores para orientar o processo de decisão. Reeve et al. sugerem a simples demarcação de ≥ 0,3 para admitir a existência de correlação residual. Há, também, demarcações baseadas em estatísticas formais. Uma é a estatística de dependência local baseada em qui-quadrado (LD χ ), proposta por Chen e Thissen , que usa o corte de ≥ 10 para indicar dependências. Outra é a estatística Q3 (e variantes) sugerida por Yen . Diversas situações levam à correlação entre resíduos (erros) de itens , mas uma comum em processos de desenvolvimento (ou ATC) de instrumentos se refere à presença de redundância de conteúdo (parcial) entre itens (em geral, pares). Avaliação teórica, observando semântica e significados denotativos e conotativos dos respectivos conteúdos, deve ser realizada quando uma violação estatística é observada.
Convergência fatorial (VFC).Os itens, em conjunto, refletem de modo convergente o fator correspondente?AFC. Variância média extraída.A VFC se refere a cada fator, como o próprio nome indica. Entende-se que há VFC se a relação entre a VME dos itens – i.e., a variância que os itens têm em comum – é, ao menos, maior do que a variância conjunta dos respectivos erros – que expressam a variabilidade dos itens, que não se deve aos fatores de interesse. Assim, quantitativamente, endossa-se a VFC se a VME ≥ 0,5 . Na perspectiva interpretativa, endossando-se VFC, aceita-se que a dimensão (fator) em tela está “bem servida” pelo respectivo conjunto de itens, uma vez que estes contêm mais informatividade relativa ao fator do que mero erro (seja amostral e/ou de aferição/processo e/ou inerente aos próprios itens componentes ). Um indicador conexo – √ – resume a confiabilidade do construto (dimensão). Assim, valores ≥ 0,7 também indicam convergência e, estritamente, que há consistência interna (i.e., consistência dos/entre os itens, interna ao fator a que pertencem) .
Discriminância fatorial (VFD).A quantidade de informação captada pelo conjunto de itens em seus respectivos fatores é maior do que aquela compartilhada entre os fatores componentes (discriminante)?AFC. Contraste da variância média extraída (pelos itens) de um determinado fator com o quadrado das correlações deste fator com os outros do sistema.Esta propriedade somente se aplica a construtos multidimensionais. Se há VFD, espera-se que haja mais informação “fluindo” dos fatores para os itens do que entre os próprios fatores. A demarcação de violação de VFD pode seguir alguma regra genérica ( ) ou uma avaliação mais formal. Em relação à primeira, alguns autores sugerem correlações fatoriais de 0,80 a < 0,85 como suspeitos e ≥ 0,85 como indicativos de violação . Uma estratégia mais rigorosa consiste em testar formalmente a significância estatística da diferença entre a VME do fator e o quadrado das correlações deste fator com os demais . Um sinal positivo e estatisticamente significativo dessa diferença endossaria a VFD, ao passo que um sinal negativo estatisticamente significativo favoreceria sua rejeição, isto é, de que há uma violação. Uma diferença não significativa positiva ou negativa pode ser uma indicação a favor ou contra uma violação. Adotando-se uma posição mais conservadora, a decisão por uma violação poderia se basear somente em uma diferença estatisticamente significativa.
EscalarCobertura de informação do traço latente (referente a cada item e para o conjunto de itens).O conjunto de itens cobre a maior parte do traço latente ou há regiões “desmapeadas”? Nas regiões do traço latente efetivamente mapeadas, os itens se distribuem de maneira uniforme ou há aglomerados indicando redundância de posicionamento?TRI paramétrica. , usando o Wright Map, que consiste em combinar o mapa do construto com estimativas de posicionamento dos itens obtidas nas análises TRI e observação de gráficos.Espera-se que os itens selecionados como manifestos do traço latente sejam capazes de adequadamente posicionar indivíduos (ou qualquer outra unidade de análise) ao longo do mapa do construto. Não menos importante, o espectro de variação previsto pelo mapa do construto também deve ser coberto de modo apropriado. Uma forma de avaliar esses dois aspectos é apreciar criticamente a disposição dos itens segundo o Wright Map proposto . Nesse sentido, aprecia-se a correspondência do posicionamento desses itens ao longo do espectro latente – por exemplo, via parâmetros b obtidos nas análises TRI – e o crescente de intensidade apresentado no mapa do construto . Esse procedimento de deve ser seguido de uma análise da cobertura de informação . Gráficos específicos permitem indicar se o conjunto de itens cobre a maior parte do traço latente ou se há regiões com lacunas, “despovoadas” de itens. Esses gráficos ajudam igualmente a detectar se todas as regiões do traço latente são efetivamente abrangidas, se os itens se distribuem de maneira razoavelmente uniforme ou se há aglomerados, indicando sobreposição e, assim, redundância de posicionamento/mapeamento. Avaliações gráficas adicionais permitem, de forma complementar, apreciar o comportamento dos itens, principalmente quanto às regiões de cobertura do traço latente. Obtidos via TRI paramétrica, esses gráficos incluem as Funções de Informações dos Itens e as Curvas da Característica do Item. No caso de os itens serem policótomos, obtêm-se as Curvas da Característica de Categorias. Essas servem também para avaliar “internamente” os itens, observando-se as áreas de cobertura de cada nível e se estas estão ordenadas conforme o pressuposto teórico encerrado no mapa do construto. Exemplos destes gráficos se encontram nas referências de fundamentação indicadas ao final deste Quadro ou em buscas na internet (e.g., https://www.stata.com/manuals/irt.pdf).
Ordenamento com escalabilidade e monotonicidade de itens.Os itens que estão mapeando regiões do mapa do construto fazem-no na ordem de intensidade teoricamente esperada ou existem regiões do construto onde itens menos severos (mais leves/brandos) suplantam outros itens que, em princípio, deveriam capturar áreas mais intensas desse traço latente?TRI não paramétrica e paramétrica. H de Loevinger, Critério de Mokken e avaliações gráficas.Espera-se que os itens consigam separar bem as regiões do traço latente (conteúdo) que devem supostamente cobrir, evitando ao máximo a ocorrência de sobreposição. Duas estratégias permitem checar isso: ordenamento com escalabilidade e monotonicidade. Ordenamento de itens com escalabilidade se refere à coerência entre as frequências com que os itens são endossados e a porção do mapa do construto que deveriam teoricamente cobrir. Em um cenário ideal, espera-se que um respondente com baixa intensidade de um determinado traço latente do construto (dimensão) efetivamente endosse um item representante (mapeador) dessa região de “menor” intensidade, ao mesmo tempo que negative outro item manifestando um grau mais intenso do construto. A análise desse aspecto pode ser realizada tanto para cada item quanto para todo o conjunto do instrumento. O coeficiente de escalabilidade H de Loevinger reflete isso . Tendo o valor 1,0 como limite superior de adequação, recomenda-se uma estimativa para o conjunto de itens de, pelo menos, 0,3 . Um H abaixo desse valor qualifica o instrumento como de má escalabilidade. Segundo Mokken , valores de 0,3 a < 0,4 indicam escalabilidade fraca; de 0,4 a < 0,5, média; e ≥ 0,5, forte. Em um instrumento aceitável, espera-se igualmente que a maioria das estimativas H de cada item também siga essas referências. O pressuposto de monotonicidade é outra propriedade conexa a ser apreciada durante a avaliação do comportamento escalar dos itens e, por extensão, do conjunto formado por estes . A monotonicidade pode ser subscrita quando a probabilidade de confirmação (positivação) de um item aumenta de maneira correspondente ao aumento de intensidade do traço latente. Visualmente, há violação de monotonicidade simples quando há declínio(s) de probabilidade de endosso à medida que cresce o escore total (latente). Adicionalmente, entende-se como violação de monotonicidade dupla se há algum cruzamento ao longo das curvas dos itens obtidas numa análise TRI. Seja simples ou dupla, considera-se que a monotonicidade está presente quando o critério sugerido por Mokken for < 40 , entendendo que alguns cruzamentos podem ser atribuídos à variabilidade amostral. Valores entre 40 e 80 servem de alerta, suscitando uma avaliação mais detalhada dos pesquisadores; um critério > 80 levanta dúvidas sobre a hipótese de monotonicidade de um item, bem como da escala como um todo .

a Legenda: ACP – análise de componentes principais; AFC – análise fatorial confirmatória; AFE – análise fatorial exploratória; MEEE – modelos de equação estrutural exploratória; TRI – modelos de teoria de resposta ao item; VFC – validade fatorial convergente; VFD – validade fatorial discriminante; VME – variância média extraída.

b Referências de fundamentação: Gorsuch 67 Rummel 68 , Brown 17 , Kline 19 , Marsh et al. 48 , Embretson e Reise 62 , Bond e Fox 27 , De Boeck e Wilson 69 , Van der Linden 21 , Davidov et al. 30 Algumas destas referências são também assinaladas ocasionalmente, quando necessário, juntamente com outras específicas.

O Quadro 2 evidencia quantas propriedades necessitam ser escrutinadas antes que se possa julgar a estrutura interna como adequada e, assim, endossar este componente de validade do instrumento 15 , 16 . É um panorama que em muito contrasta com o que a literatura habitualmente oferece, em que a validade de um instrumento tende a ser satisfeita por evidências um tanto quanto escassas e frágeis. Com efeito, não raramente decisões sobre a aceitabilidade de uma escala se escoram em algumas poucas análises fatoriais usando apenas índices de ajuste de modelo, demarcados por pontos de cortes genéricos (e.g., Root Mean Square Error of Approximation/ RMSEA, Comparative Fit Index/ CFI, Tucker-Lewis Index/ TLI 17 ) e carecendo de exames mais aprofundados sobre os itens e a(s) escala(s) como um todo. A rigor, a gama de propriedades arroladas no Quadro 2 não cabe em produtos únicos (e.g., artigos científicos), sendo necessários, para tal, estudos seriais visitando um ou mais aspectos por vez. Os meandros metodológicos relativos a cada propriedade a ser coberta certamente exigem detalhamento e maior espaço editorial.

Um ponto já abordado ilustra esse rigor fundamental: a necessidade de demarcações explícitas para se decidir se um item ou escala atende à propriedade em escrutínio. Todos os estimadores utilizados nas avaliações requerem delimitação de pontos de corte, de tal sorte que escolhas possam ser replicadas ou, se for o caso, criticadas, rejeitadas ou alteradas no decurso do desenvolvimento ou da ATC de um instrumento. O Quadro 2 procura oferecer alguns marcos indicados na literatura afim. Mais do que parâmetros de referência prescritivos, estes devem nos servir de estímulo ao exame empírico do instrumento. O ponto principal é que as muitas decisões a tomar rumo à adequabilidade psicométrica de um instrumento precisam de âncoras claras e previamente acordadas com os pares da comunidade científica. A literatura, por certo, se enriqueceria se estes pormenores se estendessem aos artigos de divulgação científica.

Uma questão a salientar é que, no contexto processual em tela, as análises multivariadas têm utilidade primordial como dispositivos diagnósticos. Sendo ferramentas de processo, devem atender às perguntas centrais postuladas a priori . Nesse sentido, é preciso distinguir as questões eminentemente qualitativas das quantitativas que envolvem estritamente a esfera técnico-metodológica. A terceira propriedade configural apresentada no Quadro 2 serve de exemplo. Mais do que simplesmente verificar se uma análise fatorial exploratória mostra uma carga cruzada, importa responder se efetivamente há violação de especificidade fatorial, o que seria antitético ao projetado na primeira fase, quando da elaboração do protótipo. A presença de uma carga cruzada de larga monta sugere ambiguidade no item em tela, que não seria exclusivo ao fator pressuposto e que, portanto, sua função como um “representante empírico” do mapa do construto não seria satisfeita. Aqui, a evidência quantitativa atende à qualitativa, sinalizando que há problema e necessidade de ação, seja modificando a semântica do item, seja substituindo-o por outro de melhor propriedade. Em nada diferentes, as demais propriedades demandam o mesmo olhar.

Para além das propriedades internas de itens e escalas sintetizadas no Quadro 2 , duas outras questões relacionadas merecem alusão por sua recorrência. A primeira diz respeito à presunção de invariância de medida (configural, métrica e escalar) 17 . A suposição de que o desempenho de um instrumento não varia em domínios populacionais diferentes é quase uma regra. No mais das vezes, assume-se tacitamente que o funcionamento do instrumento é consistente entre os diversos grupos populacionais investigados (e.g., gêneros, faixas etárias, escolaridades, estratos geográficos), de modo que as diferenças encontradas entre eles são tomadas como factuais e não decorrentes de problemas de mensuração. No entanto, esta é uma posição difícil de sustentar sem maiores evidências, uma vez que o funcionamento inconsistente de um instrumento em subgrupos populacionais pode conduzir a inferências espúrias e, no limite, a decisões e ações sanitárias ineficientes ou até mesmo danosas 20 . Há de se ter cuidado nessa direção, levando os programas de investigação sobre instrumentos de aferição um passo adiante. Não cabe apenas escrutinar suas propriedades, mas avaliá-las em diversos segmentos populacionais. Garantir invariância do instrumento em diferentes grupos populacionais é permitir comparações fidedignas.

Adjacente à invariância está a questão da equalização e ligação ( linking ) de instrumentos 22 . Trata-se da busca de métricas comuns a instrumentos que supostamente captam o mesmo construto, mas que possuem itens distintos e/ou com opções de resposta variadas 25 , 26 . Em ambas as situações há de se ter cuidado ao oferecer sínteses. Resultados de estudos podem não ser comparáveis se, mesmo focados em um mesmo construto, são realizados em domínios populacionais diferentes e com instrumentos distintos. Sem equalização, ferramentas de aferição podem carecer de sintonia métrica e escalar.

Também ligada às propriedades escalares de um instrumento está a adequação de agrupamentos quando se aplicam pontos de corte a um escore (seja bruto, formado pelo somatório dos escores dos itens componentes, seja baseado em modelos, como o são os escores fatoriais ou Rasch 27 , 28 ). Esse ponto merece atenção, especialmente no que diz respeito às abordagens frequentemente utilizadas em epidemiologia. Não é incomum categorizar um escore em um ou poucos grupos, muitas vezes o inflexionando na média, na mediana ou em algum outro ponto “estatisticamente interessante”. Esse procedimento, no entanto, não é desprovido de riscos, uma vez que a população de estudo não necessariamente é particionada em grupos homogêneos internamente e heterogêneos entre si. O conhecimento de especialistas sobre o objeto é certamente fundamental no processo de se especificar agrupamentos adequados, mas a busca da semelhança interna de grupos com distinção comparativa pode ser mais bem servida utilizando-se adicionalmente abordagens baseadas em modelos, tais como análises de classes latentes ou modelos de mistura finita 29 .

AVALIAÇÃO DA CONEXÃO CONSTRUTO-TEORIA

O Quadro 3 propõe uma tipologia, na linha do que seria a validade por teste de hipótese apresentada no início da década de 2010 pela iniciativa COSMIN ( COnsensus-based Standards for the selection of health Measurement INstruments ) 15 , 16 , 33 . Ao contrário da concisão aparente da tipologia, esta etapa da segunda fase de avaliação de um instrumento implica, de fato, um longo processo, talvez tão longo quanto caberia estudar o próprio construto em tela, em todas as suas relações de causas e efeitos. Revisitando outros textos 7 , 11 , valeria lembrar que determinar a validade de um instrumento corresponde, em última instância, ao estabelecimento da própria validade da teoria da qual faz parte o construto que o instrumento se propõe a medir. Um tanto circular e algo desalentador devido ao longo caminho que projeta, este raciocínio, por sua vez, nos alerta para o quão arriscado e imprudente é restringir o endosso e a aprovação de um instrumento a algumas poucas investidas de pesquisa. A solidificação e, por fim, o aval de adequabilidade de um instrumento requerem muitas testagens, seja no âmbito interno ao instrumento, seja de suas conexões externas.

Etapa de avaliaçãoQuestões a serem respondidasTécnica/método/modelo Comentários
Avaliação de relações entre (sub)escalas do instrumento.As (sub)escalas que constituem o instrumento se associam na direção e magnitude esperadas?Testes de associação paramétricos ou não paramétricos entre as (sub)escalas que constituem o instrumento.Esse aspecto já poderia ter sido preliminarmente contemplado na avaliação de validade discriminante envolvendo correlação fatorial, na etapa de avaliação da estrutura interna. Nesse momento de análise, porém, os testes já são baseados nos próprios escores das escalas (sejam brutos ou estimados em modelos), refinados em etapas anteriores, principalmente quanto à estrutura escalar.
Avaliação de relações entre as (sub)escalas com outros instrumentos do mesmo construto que não sejam considerados de referênciaO instrumento se associa com outro que afere o mesmo construto de forma semelhante (convergente)? Com que magnitude?Comparação de grupos extremos e testes de associação paramétricos ou não paramétricos.Esta etapa diz respeito à validade de construto. Em conjunto, a validade de constructo, de conteúdo e de critério são conhecidas como os três Cs descritos em muitos livros-texto no âmbito da teoria clássica de medida.
Avaliação de relações entre as (sub)escalas com outro instrumento (ou procedimento) considerado de referência para o próprio construto.O instrumento é capaz de medir o que se propõe quando há outro de referência?Estimativa de sensibilidade, especificidade e área abaixo da curva ROC ( ) do instrumento, tendo como referência um critério concorrente (instrumento de referência) e/ou um desfecho futuro a ser predito.A literatura tradicionalmente denomina esta etapa como validade de critério (um dos três Cs), subdividida em validade concorrente e validade preditiva.
Avaliação de relações entre a (sub)escala com outras que não sejam do construto em tela.O instrumento confirma as predições e hipóteses gerais da teoria que o envolve, i.e., da sua rede nomológica? O instrumento deixa de se relacionar com outros construtos que não fazem parte da teoria geral que abrange o fenômeno de interesse?Análises multivariadas de dados, modelos causais complexos e outras técnicas estatísticas que permitam avaliar as relações de interesse com maior rigor e precisão.Avaliação de relações entre a (sub)escala com outras que não sejam do construto em tela.

a Referências de fundamentação: Streiner et al. 7 , Bastos et al. 3 , Reichenheim e Moraes 6 , Lissitz 70 , Armitage et al. 71 , Corder e Foreman 72 , Kline 19 , Little 61 , Hernán e Robins 5 , VanderWeele 35 .

Nessa direção, conforme sugere o Quadro 3 , validar externamente um instrumento vai de simples testes de associação entre as subescalas componentes até testes de intrincadas hipóteses sobre o construto e que a literatura entende como a rede nomológica das predições interligadas de uma teoria 5 , 7 , 34 , 35 . Seja qual for o nível de complexidade da investida externa, uma pergunta que se impõe – e que frequentemente surge no âmbito das publicações científicas – é quando um estudo de validade externa de um instrumento deve ser executado, dadas as etapas a serem antes superadas para melhor conhecer seus meandros. Vale investir em projetos de pesquisa na linha do que o Quadro 3 indica sem antes ter uma mínima evidência sobre a sustentabilidade das estruturas configural, métricas e escalares do instrumento? É preciso reconhecer que correlações entre escalas (e.g., do instrumento em tela e de outras que cubram o mesmo construto) podem perfeitamente se materializar, mesmo diante de múltiplas insuficiências psicométricas de âmbito interno. O que significariam estas correlações, sabendo-se, por exemplo, que o conjunto de itens não atende satisfatoriamente aos requisitos de especificidade fatorial, validade fatorial convergente e escalabilidade? A resposta baseada na mera correlação indicaria validade externa, mas restaria perguntar “de quê?”, se a capacidade de representação do construto é falha e pouco informativa. Não há resposta clara a essas questões, mas é preciso levantá-las antes de se proceder “cegamente” a estudos de validade externa. O timing dessas etapas é evidentemente da alçada de cada programa de investigação, mas o ditado “quem tem pressa, come cru” serve de lembrete: pouco tempo e esforço (e recursos!) investidos em uma etapa pode ser tempo e esforço (e recursos!) dobrados em outra posterior.

CONSIDERAÇÕES FINAIS

Com a leitura do presente artigo, deve ficar claro que o desenvolvimento de um instrumento de aferição envolve um processo extenso, compreendendo múltiplos estudos concatenados. Há de se notar que a trajetória pode ser ainda mais longa e tortuosa em se considerando os estudos de replicação ou quando certos estudos psicométricos suscitam questões que requerem respostas fundamentais que só a retomada da fase prototípica do desenvolvimento pode oferecer. Esse panorama contrasta sobremaneira com a forma como os investigadores em epidemiologia costumam abordar seus instrumentos de aferição. Como visto, ao contrário do que muitos supõem, evidências sobre a adequação de uma ferramenta de medida não se esgotam em um ou dois estudos sobre sua constituição dimensional, acompanhados, quiçá, da magnitude das cargas fatoriais encontradas. Esse alerta se estende também a acríticas análises de validade externa que, conforme mencionado na seção antecedente, requerem que a constituição interna do instrumento esteja minimamente cuidada.

E há também o desenvolvimento e o refino de versões para que as pesquisas realizadas em populações socioculturalmente distintas guardem comparabilidade e possam dialogar entre si. O processo de ATC não é menos intrincado do que o de um instrumento novo. Todas as fases e etapas se aplicam igualmente aqui. Aliás, um(a) pesquisador(a) realizando uma ATC frequentemente se depara com variadas lacunas no próprio programa de investigação original do instrumento. Por vezes, há problemas na execução dos estudos disponíveis; outras (muitas) vezes, diversas propriedades sequer foram estudadas. Nesse momento, o foco passa das equivalências ( cf. seção sobre Cenários de Pesquisa) para o cerne da própria estrutura do instrumento. Isso não é trivial, pois haverá sempre a ambivalência entre se tratar de um problema intrínseco da ferramenta e ser um problema no processo de ATC 6 , 11 . Seja como for, examinar um instrumento em outro contexto sociocultural demanda ainda mais tempo e esforços. Não é para menos que muitos entendem as instâncias de ATC como mais uma etapa de validação de construto 33 .

Uma questão que se coloca frequentemente é se todas as etapas precisam ser cumpridas para tornar o instrumento apto à utilização em pesquisa ou aplicação nos serviços de saúde. Esta é uma pergunta difícil de responder, mas alguns marcos podem nos guiar. Um já foi aventado na seção sobre as fases do processo: ter uma fase prototípica bem planejada e desenvolvida ajuda sobremaneira a obter resultados favoráveis na segunda grande fase do processo. Profundidade na primeira fase não somente contribui para se chegar a melhores propriedades psicométricas, mas também agrega eficiência, na medida em que vários problemas tendem a ser mitigados ou mesmo evitados precocemente. Cumpre lembrar que os estudos epidemiológicos na fase psicométrica, a rigor, costumam ser de grande porte e, logo, são raramente passíveis de replicações com vistas à resolução de anomalias emergentes.

Outro norte é recorrer aos fundamentos, lembrando sempre a essência de cada propriedade e o que significa sua violação. Por exemplo, sentiríamos firmeza em declarar um instrumento como válido e pronto para uso à luz de umas poucas análises fatoriais exploratórias – afirmando preliminarmente uma estrutura configural – e/ou alguns estudos correlacionando o(s) escore(s) da(s) (sub)escala(s) em teste com certas variáveis sociodemográficas – que ofereçam uma primeira evidência sobre a pertinência teórica? Dada a gama de possibilidades substantivas e processuais que visitamos, seria isto suficiente ou deveríamos adiar a utilização do instrumento e obter adicionais e diversas provas para apoiar sua validade? Reiteramos que não há resposta rápida e pronta, mas que, talvez, uma máxima possa nos ser útil à tomada de decisão: ainda que não estejamos preparados a deixar o ótimo atrapalhar o bom, ou mesmo deixar o bom atrapalhar o razoável, pode ser que valha a pena deixar o razoável atrapalhar o ruim. Embora seja uma perspectiva subjetiva – sempre a ser negociada entre pares –, se colocada em prática, possivelmente nos conduzirá a melhores instrumentos e, conforme já apontamos, a melhores resultados e comparações entre estudos ou ações de saúde.

O contínuo desenvolvimento, refinamento e adaptação de instrumentos de aferição deve ser visto como parte fundamental e integrada à pesquisa epidemiológica. A construção do conhecimento requer instrumental em patamares aceitáveis de validade e confiabilidade, à altura dos rigores comumente exigidos, por exemplo, na elaboração de desenhos de estudos e suas complexas análises. De nada adiantam meticulosidades e aprimoramentos nessas esferas se o diálogo entre as publicações e a apreciação de consistência das evidências científicas acabam falhando por conta da precariedade dos instrumentos utilizados. Sendo também produtos voltados ao uso coletivo, instrumentos de aferição demandam processos de desenvolvimento que pouco diferem dos encontrados para medicamentos ou outras tecnologias de saúde. E, como tal, merecem zelo e dedicação.

Financiamento: MER foi parcialmente apoiado pelo Conselho Nacional de Desenvolvimento Científico e Tecnológico do Brasil (CNPq - Processo 301381/2017-8). JLB foi parcialmente financiado pelo Conselho Nacional de Desenvolvimento Científico e Tecnológico do Brasil (CNPq - Processo 304503/2018-5).

Measurement Issues in Quantitative Research

  • Reference work entry
  • First Online: 13 January 2019
  • Cite this reference work entry

a measurement in research

  • Dafna Merom 2 , 3 &
  • James Rufus John 3 , 4  

1202 Accesses

1 Citations

Measurement is central to empirical research whether observational or experimental. Common to all measurements is the systematic application of numerical value (scale) to a variable or a factor we wish to quantify. Measurement can be applied to physical, biological, or chemical attribute or to more complex factors such as human behaviors, attitudes, physical, social, or psychological characteristics or the combination of several characteristics that denote a concept. There are many reasons for the act of measurement that are relevant to health and social science disciplines: for understanding aetiology of disease or developmental processes, for evaluating programs, for monitoring progress, and for decision-making. Regardless of the specific purpose, we should aspire that our measurement be adequate. In this chapter, we review the properties that determine the adequacy of our measurement (reliability, validity, and sensitivity) and provide examples of statistical methods that are used to quantify these properties. At the concluding section, we provide examples from the physical activity and public health field in the four areas for which precise measurements are necessary illustrating how imprecise or biased scoring procedure can lead to erroneous decisions across the four major purposes of measurement.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save.

  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

a measurement in research

Effect size measures for multilevel models: definition, interpretation, and TIMSS example

a measurement in research

Optimizing Detection of True Within-Person Effects for Intensive Measurement Designs: A Comparison of Multilevel SEM and Unit-Weighted Scale Scores

Bolarinwa OA. Principles and methods of validity and reliability testing of questionnaires used in social and health science researches. Niger Postgrad Med J. 2015;22(4):195.

Article   Google Scholar  

Bowling A, Ebrahim S. Key issues in the statistical analysis of quantitative data in research on health and health services. In: Handbook of health research methods: investigation, measurement and analysis. England: Open University Press McGraw Hill Education Birshire; 2005. p. 497–514.

Google Scholar  

Brink H. Validity and reliability in qualitative research. Curationis. 1993;16(2):35–8.

Brown WJ, Trost SG, Bauman A, Mummery K, Owen N. Test-retest reliability of four physical activity measures used in population. J Sci Med Sport. 2004;7(2):205–15.

Brownson RC, Jones DA, Pratt M, Blanton C, Heath GW. Measuring physical activity with the behavioral risk factor surveillance system. Med Sci Sports Exerc. 2000;32(11):1913–8.

Busija L, Pausenberger E, Haines TP, Haymes S, Buchbinder R, Osborne RH. Adult measures of general health and health-related quality of life: Medical Outcomes Study Short Form 36-Item (SF-36) and Short Form 12-Item (SF-12) Health Surveys, Nottingham Health Profile (NHP), Sickness Impact Profile (SIP), Medical Outcomes study Short Form 36-Item (SF-36) and Short Form 12-Item (SF-12) Health Surveys, Nottingham Health Profile (NHP), Sickness Impact Profile (SIP), Medical Outcomes Study Short Form 6D (SF-6D), Health Utilities Index Mark 3 (HUI3), Quality of Well-Being Scale (QWB), and Assessment of Quality of Life (AQoL). Arthritis Care and Research. 2011;63(Supll S11):S383–S4121.

Cerin E, Saelens BE, Sallis JF, Frank LD. Neighborhood environment walkability scale: validity and development of a short form. Med Sci Sports Exerc. 2006;38(9):1682–91.

Davis RE, Couper MP, Janz NK, Caldwell CH, Resnicow K. Interviewer effects in public health surveys. Health Educ Res. 2009;25(1):14–26.

De Bruin A, Diederiks J, De Witte L, Stevens F, Philipsen H. Assessing the responsiveness of a functional status measure: the Sickness Impact Profile versus the SIP68. J Clin Epidemiol. 1997;50(5):529–40.

Delgado-Rodríguez M, Llorca J. Bias. J Epidemiol Community Health. 2004;58(8):635–41.

Deyo RA, Centor RM. Assessing the responsiveness of functional scales to clinical change: an analogy to diagnostic test performance. J Chronic Dis. 1986;39(11):897–906.

Deyo RA, Diehr P, Patrick DL. Reproducibility and responsiveness of health status measures statistics and strategies for evaluation. Control Clin Trials. 1991;12((4):S142–58.

Downing SM. Validity: on the meaningful interpretation of assessment data. Med Educ. 2003;37:830–7.

Fok CCT, Henry D. Increasing the sensitivity of measures to change. Prev Sci. 2015;16(7):978–86.

Gadotti I, Vieira E, Magee D. Importance and clarification of measurement properties in rehabilitation. Braz J Phys Ther. 2006;10(2):137–46.

Golafshani N. Understanding reliability and validity in qualitative research. Qual Rep. 2003;8(4):597–606.

Grant JS, Davis LL. Focus on quantitative methods: Selection and use of content experts for instrument development. Research in Nursing and Health. 1997;20:269–74.

Griffiths P, Rafferty AM. Outcome measures (Gerrish K, Lathlean J, Cormack D, editors), 7th ed. West Sussex, UK: Wiley Blackwell; 2014.

Harris T, Kerry SM, Limb ES, Victor CR, Iliffe S, Ussher M, … Cook DG. Effect of a primary care walking intervention with and without nurse support on physical activity levels in 45- to 75-year-olds: the Pedometer And Consultation Evaluation (PACE-UP) cluster randomised clinical trial. PLoS Med. 2016;14(1):e1002210. https://doi.org/10.1371/journal.pmed.1002210 .

Heale R, Twycross A. Validity and reliability in quantitative studies. Evid Based Nurs. 2015. https://doi.org/10.1136/eb-2015-102129 .

Husted JA, Cook RJ, Farewell VT, Gladman DD. Methods for assessing responsiveness: a critical review and recommendations. J Clin Epidemiol. 2000;53(5):459–68.

Kimberlin CL, Winetrstein AG. Validity and reliability of measurement instruments used in research. Am J Health Syst Pharm. 2008;65(23):2276.

Last MJ. A dictionary of epidemiology. 4th ed. New York: Oxford University Press; 2001.

Leung L. Validity, reliability, and generalizability in qualitative research. J Fam Med Prim Care. 2015;4(3):324.

Manoj S, Lingyak P. Measurement and evaluation for health educators. Burlington: Jones & Bartlett Learning; 2014.

Merom D, Korycinski R. Measurement of walking. In: Mulley C, Gebel K, Ding D, editors. Walking, vol. 11–39. West Yorkshire, UK: Emerald Publishing; 2017.

Chapter   Google Scholar  

Merom D, Rissel C, Phongsavan P, Smith BJ, van Kemenade C, Brown W, Bauman A. Promoting walking with pedometers in the community. The step-by-step trial. Am J Prev Med. 2007;32(4):290–7.

Merom D, Bowles H, Bauman A. Measuring walking for physical activity surveillance – the effect of prompts and respondents’ interpretation of walking in a leisure time survey. J Phys Act Health. 2009;6:S81–8.

Nunan D. Research methods in language learning. Cambridge: Cambridge University Press; 1992.

Pannucci CJ, Wilkins EG. Identifying and avoiding bias in research. Plast Reconstr Surg. 2010;126(2):619.

Revicki D, Hays RD, Cella D, Sloan J. Recommended methods for determining responsiveness and minimally important differences for patient-reported outcomes. J Clin Epidemiol. 2008;61(2):102–9.

Schmidt S, Bullinger M. Current issues in cross-cultural quality of life instrument development. Arch Phys Med Rehabil. 2003;84(Suppl 2):S29–34.

Stamatakis E, Ekelund U, Wareham NJ. Temporal trends in physical activity in England: the Health Survey for England 1991 to 2004. Prev Med. 2007;45:416–23.

Streiner D, Norman G. Health measurement scales: a practical guide to their development and use. Oxford: Oxford University Press; 2003.

Terwee C, Dekker F, Wiersinga W, Prummel M, Bossuyt P. On assessing responsiveness of health-related quality of life instruments: guidelines for instrument evaluation. Qual Life Res. 2003;12(4):349–62.

Thorndike RM. Measurement and evaluation in psychology and education. 7th ed. Upper Saddle River: Pearson Prentice Hall; 2007.

Ursachi G, Horodnic IA, Zait A. How reliable are measurement scales? External factors with indirect influence on reliability estimators. Procedia Economics and Finance. 2015;20:679–86.

Walters SJ. Quality of life outcomes in clinical trials and health-care evaluation: a practical guide to analysis and interpretation, vol. 84. West Yorkshire, UK: Wiley; 2009.

Winzenberg T, Shaw KS. Screening for physical activity in general practice a test of diagnostic criteria. Aust Fam Physician. 2011;40(1):57–61.

Yu S, Yarnell JW, Sweetnam PM, Murray L. What level of physical activity protects against premature cardiovascular death? The Caerphilly study. Heart. 2003;89(5):502–6.

Download references

Author information

Authors and affiliations.

School of Science and Health, Western Sydney University, Penrith, Sydeny, NSW, Australia

Dafna Merom

Translational Health Research Institute, School of Medicine, Western Sydney University, Penrith, NSW, Australia

Dafna Merom & James Rufus John

Capital Markets Cooperative Research Centre, Sydney, NSW, Australia

James Rufus John

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Dafna Merom .

Editor information

Editors and affiliations.

School of Science and Health, Western Sydney University, Penrith, NSW, Australia

Pranee Liamputtong

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Singapore Pte Ltd.

About this entry

Cite this entry.

Merom, D., John, J.R. (2019). Measurement Issues in Quantitative Research. In: Liamputtong, P. (eds) Handbook of Research Methods in Health Social Sciences. Springer, Singapore. https://doi.org/10.1007/978-981-10-5251-4_95

Download citation

DOI : https://doi.org/10.1007/978-981-10-5251-4_95

Published : 13 January 2019

Publisher Name : Springer, Singapore

Print ISBN : 978-981-10-5250-7

Online ISBN : 978-981-10-5251-4

eBook Packages : Social Sciences Reference Module Humanities and Social Sciences Reference Module Business, Economics and Social Sciences

Share this entry

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

Logo for BCcampus Open Publishing

Want to create or adapt books like this? Learn more about how Pressbooks supports open publishing practices.

Chapter 5: Psychological Measurement

Reliability and Validity of Measurement

Learning Objectives

  • Define reliability, including the different types and how they are assessed.
  • Define validity, including the different types and how they are assessed.
  • Describe the kinds of evidence that would be relevant to assessing the reliability and validity of a particular measure.

Again, measurement involves assigning scores to individuals so that they represent some characteristic of the individuals. But how do researchers know that the scores actually represent the characteristic, especially when it is a construct like intelligence, self-esteem, depression, or working memory capacity? The answer is that they conduct research using the measure to confirm that the scores make sense based on their understanding of the construct being measured. This is an extremely important point. Psychologists do not simply  assume  that their measures work. Instead, they collect data to demonstrate  that they work. If their research does not demonstrate that a measure works, they stop using it.

As an informal example, imagine that you have been dieting for a month. Your clothes seem to be fitting more loosely, and several friends have asked if you have lost weight. If at this point your bathroom scale indicated that you had lost 10 pounds, this would make sense and you would continue to use the scale. But if it indicated that you had gained 10 pounds, you would rightly conclude that it was broken and either fix it or get rid of it. In evaluating a measurement method, psychologists consider two general dimensions: reliability and validity.

Reliability

Reliability  refers to the consistency of a measure. Psychologists consider three types of consistency: over time (test-retest reliability), across items (internal consistency), and across different researchers (inter-rater reliability).

Test-Retest Reliability

When researchers measure a construct that they assume to be consistent across time, then the scores they obtain should also be consistent across time.  Test-retest reliability  is the extent to which this is actually the case. For example, intelligence is generally thought to be consistent across time. A person who is highly intelligent today will be highly intelligent next week. This means that any good measure of intelligence should produce roughly the same scores for this individual next week as it does today. Clearly, a measure that produces highly inconsistent scores over time cannot be a very good measure of a construct that is supposed to be consistent.

Assessing test-retest reliability requires using the measure on a group of people at one time, using it again on the  same  group of people at a later time, and then looking at  test-retest correlation  between the two sets of scores. This is typically done by graphing the data in a scatterplot and computing Pearson’s  r . Figure 5.2 shows the correlation between two sets of scores of several university students on the Rosenberg Self-Esteem Scale, administered two times, a week apart. Pearson’s r for these data is +.95. In general, a test-retest correlation of +.80 or greater is considered to indicate good reliability.

Score at time 1 is on the x-axis and score at time 2 is on the y-axis, showing fairly consistent scores

Again, high test-retest correlations make sense when the construct being measured is assumed to be consistent over time, which is the case for intelligence, self-esteem, and the Big Five personality dimensions. But other constructs are not assumed to be stable over time. The very nature of mood, for example, is that it changes. So a measure of mood that produced a low test-retest correlation over a period of a month would not be a cause for concern.

Internal Consistency

A second kind of reliability is  internal consistency , which is the consistency of people’s responses across the items on a multiple-item measure. In general, all the items on such measures are supposed to reflect the same underlying construct, so people’s scores on those items should be correlated with each other. On the Rosenberg Self-Esteem Scale, people who agree that they are a person of worth should tend to agree that that they have a number of good qualities. If people’s responses to the different items are not correlated with each other, then it would no longer make sense to claim that they are all measuring the same underlying construct. This is as true for behavioural and physiological measures as for self-report measures. For example, people might make a series of bets in a simulated game of roulette as a measure of their level of risk seeking. This measure would be internally consistent to the extent that individual participants’ bets were consistently high or low across trials.

Like test-retest reliability, internal consistency can only be assessed by collecting and analyzing data. One approach is to look at a  split-half correlation . This involves splitting the items into two sets, such as the first and second halves of the items or the even- and odd-numbered items. Then a score is computed for each set of items, and the relationship between the two sets of scores is examined. For example, Figure 5.3 shows the split-half correlation between several university students’ scores on the even-numbered items and their scores on the odd-numbered items of the Rosenberg Self-Esteem Scale. Pearson’s  r  for these data is +.88. A split-half correlation of +.80 or greater is generally considered good internal consistency.

Score on even-numbered items is on the x-axis and score on odd-numbered items is on the y-axis, showing fairly consistent scores

Perhaps the most common measure of internal consistency used by researchers in psychology is a statistic called  Cronbach’s α  (the Greek letter alpha). Conceptually, α is the mean of all possible split-half correlations for a set of items. For example, there are 252 ways to split a set of 10 items into two sets of five. Cronbach’s α would be the mean of the 252 split-half correlations. Note that this is not how α is actually computed, but it is a correct way of interpreting the meaning of this statistic. Again, a value of +.80 or greater is generally taken to indicate good internal consistency.

Interrater Reliability

Many behavioural measures involve significant judgment on the part of an observer or a rater.  Inter-rater reliability  is the extent to which different observers are consistent in their judgments. For example, if you were interested in measuring university students’ social skills, you could make video recordings of them as they interacted with another student whom they are meeting for the first time. Then you could have two or more observers watch the videos and rate each student’s level of social skills. To the extent that each participant does in fact have some level of social skills that can be detected by an attentive observer, different observers’ ratings should be highly correlated with each other. Inter-rater reliability would also have been measured in Bandura’s Bobo doll study. In this case, the observers’ ratings of how many acts of aggression a particular child committed while playing with the Bobo doll should have been highly positively correlated. Interrater reliability is often assessed using Cronbach’s α when the judgments are quantitative or an analogous statistic called Cohen’s κ (the Greek letter kappa) when they are categorical.

Validity  is the extent to which the scores from a measure represent the variable they are intended to. But how do researchers make this judgment? We have already considered one factor that they take into account—reliability. When a measure has good test-retest reliability and internal consistency, researchers should be more confident that the scores represent what they are supposed to. There has to be more to it, however, because a measure can be extremely reliable but have no validity whatsoever. As an absurd example, imagine someone who believes that people’s index finger length reflects their self-esteem and therefore tries to measure self-esteem by holding a ruler up to people’s index fingers. Although this measure would have extremely good test-retest reliability, it would have absolutely no validity. The fact that one person’s index finger is a centimetre longer than another’s would indicate nothing about which one had higher self-esteem.

Discussions of validity usually divide it into several distinct “types.” But a good way to interpret these types is that they are other kinds of evidence—in addition to reliability—that should be taken into account when judging the validity of a measure. Here we consider three basic kinds: face validity, content validity, and criterion validity.

Face Validity

Face validity  is the extent to which a measurement method appears “on its face” to measure the construct of interest. Most people would expect a self-esteem questionnaire to include items about whether they see themselves as a person of worth and whether they think they have good qualities. So a questionnaire that included these kinds of items would have good face validity. The finger-length method of measuring self-esteem, on the other hand, seems to have nothing to do with self-esteem and therefore has poor face validity. Although face validity can be assessed quantitatively—for example, by having a large sample of people rate a measure in terms of whether it appears to measure what it is intended to—it is usually assessed informally.

Face validity is at best a very weak kind of evidence that a measurement method is measuring what it is supposed to. One reason is that it is based on people’s intuitions about human behaviour, which are frequently wrong. It is also the case that many established measures in psychology work quite well despite lacking face validity. The Minnesota Multiphasic Personality Inventory-2 (MMPI-2) measures many personality characteristics and disorders by having people decide whether each of over 567 different statements applies to them—where many of the statements do not have any obvious relationship to the construct that they measure. For example, the items “I enjoy detective or mystery stories” and “The sight of blood doesn’t frighten me or make me sick” both measure the suppression of aggression. In this case, it is not the participants’ literal answers to these questions that are of interest, but rather whether the pattern of the participants’ responses to a series of questions matches those of individuals who tend to suppress their aggression.

Content Validity

Content validity  is the extent to which a measure “covers” the construct of interest. For example, if a researcher conceptually defines test anxiety as involving both sympathetic nervous system activation (leading to nervous feelings) and negative thoughts, then his measure of test anxiety should include items about both nervous feelings and negative thoughts. Or consider that attitudes are usually defined as involving thoughts, feelings, and actions toward something. By this conceptual definition, a person has a positive attitude toward exercise to the extent that he or she thinks positive thoughts about exercising, feels good about exercising, and actually exercises. So to have good content validity, a measure of people’s attitudes toward exercise would have to reflect all three of these aspects. Like face validity, content validity is not usually assessed quantitatively. Instead, it is assessed by carefully checking the measurement method against the conceptual definition of the construct.

Criterion Validity

Criterion validity  is the extent to which people’s scores on a measure are correlated with other variables (known as  criteria ) that one would expect them to be correlated with. For example, people’s scores on a new measure of test anxiety should be negatively correlated with their performance on an important school exam. If it were found that people’s scores were in fact negatively correlated with their exam performance, then this would be a piece of evidence that these scores really represent people’s test anxiety. But if it were found that people scored equally well on the exam regardless of their test anxiety scores, then this would cast doubt on the validity of the measure.

A criterion can be any variable that one has reason to think should be correlated with the construct being measured, and there will usually be many of them. For example, one would expect test anxiety scores to be negatively correlated with exam performance and course grades and positively correlated with general anxiety and with blood pressure during an exam. Or imagine that a researcher develops a new measure of physical risk taking. People’s scores on this measure should be correlated with their participation in “extreme” activities such as snowboarding and rock climbing, the number of speeding tickets they have received, and even the number of broken bones they have had over the years. When the criterion is measured at the same time as the construct, criterion validity is referred to as concurrent validity ; however, when the criterion is measured at some point in the future (after the construct has been measured), it is referred to as predictive validity (because scores on the measure have “predicted” a future outcome).

Criteria can also include other measures of the same construct. For example, one would expect new measures of test anxiety or physical risk taking to be positively correlated with existing measures of the same constructs. This is known as convergent validity .

Assessing convergent validity requires collecting data using the measure. Researchers John Cacioppo and Richard Petty did this when they created their self-report Need for Cognition Scale to measure how much people value and engage in thinking (Cacioppo & Petty, 1982) [1] . In a series of studies, they showed that people’s scores were positively correlated with their scores on a standardized academic achievement test, and that their scores were negatively correlated with their scores on a measure of dogmatism (which represents a tendency toward obedience). In the years since it was created, the Need for Cognition Scale has been used in literally hundreds of studies and has been shown to be correlated with a wide variety of other variables, including the effectiveness of an advertisement, interest in politics, and juror decisions (Petty, Briñol, Loersch, & McCaslin, 2009) [2] .

Discriminant Validity

Discriminant validity , on the other hand, is the extent to which scores on a measure are not correlated with measures of variables that are conceptually distinct. For example, self-esteem is a general attitude toward the self that is fairly stable over time. It is not the same as mood, which is how good or bad one happens to be feeling right now. So people’s scores on a new measure of self-esteem should not be very highly correlated with their moods. If the new measure of self-esteem were highly correlated with a measure of mood, it could be argued that the new measure is not really measuring self-esteem; it is measuring mood instead.

When they created the Need for Cognition Scale, Cacioppo and Petty also provided evidence of discriminant validity by showing that people’s scores were not correlated with certain other variables. For example, they found only a weak correlation between people’s need for cognition and a measure of their cognitive style—the extent to which they tend to think analytically by breaking ideas into smaller parts or holistically in terms of “the big picture.” They also found no correlation between people’s need for cognition and measures of their test anxiety and their tendency to respond in socially desirable ways. All these low correlations provide evidence that the measure is reflecting a conceptually distinct construct.

Key Takeaways

  • Psychological researchers do not simply assume that their measures work. Instead, they conduct research to show that they work. If they cannot show that they work, they stop using them.
  • There are two distinct criteria by which researchers evaluate their measures: reliability and validity. Reliability is consistency across time (test-retest reliability), across items (internal consistency), and across researchers (interrater reliability). Validity is the extent to which the scores actually represent the variable they are intended to.
  • Validity is a judgment based on various types of evidence. The relevant evidence includes the measure’s reliability, whether it covers the construct of interest, and whether the scores it produces are correlated with other variables they are expected to be correlated with and not correlated with variables that are conceptually distinct.
  • The reliability and validity of a measure is not established by any single study but by the pattern of results across multiple studies. The assessment of reliability and validity is an ongoing process.
  • Practice: Ask several friends to complete the Rosenberg Self-Esteem Scale. Then assess its internal consistency by making a scatterplot to show the split-half correlation (even- vs. odd-numbered items). Compute Pearson’s  r too if you know how.
  • Discussion: Think back to the last college exam you took and think of the exam as a psychological measure. What construct do you think it was intended to measure? Comment on its face and content validity. What data could you collect to assess its reliability and criterion validity?
  • Cacioppo, J. T., & Petty, R. E. (1982). The need for cognition. Journal of Personality and Social Psychology, 42 , 116–131. ↵
  • Petty, R. E, Briñol, P., Loersch, C., & McCaslin, M. J. (2009). The need for cognition. In M. R. Leary & R. H. Hoyle (Eds.), Handbook of individual differences in social behaviour (pp. 318–329). New York, NY: Guilford Press. ↵

The consistency of a measure.

The consistency of a measure over time.

The consistency of a measure on the same group of people at different times.

Consistency of people’s responses across the items on a multiple-item measure.

Method of assessing internal consistency through splitting the items into two sets and examining the relationship between them.

A statistic in which α is the mean of all possible split-half correlations for a set of items.

The extent to which different observers are consistent in their judgments.

The extent to which the scores from a measure represent the variable they are intended to.

The extent to which a measurement method appears to measure the construct of interest.

The extent to which a measure “covers” the construct of interest.

The extent to which people’s scores on a measure are correlated with other variables that one would expect them to be correlated with.

In reference to criterion validity, variables that one would expect to be correlated with the measure.

When the criterion is measured at the same time as the construct.

when the criterion is measured at some point in the future (after the construct has been measured).

When new measures positively correlate with existing measures of the same constructs.

The extent to which scores on a measure are not correlated with measures of variables that are conceptually distinct.

Research Methods in Psychology - 2nd Canadian Edition Copyright © 2015 by Paul C. Price, Rajiv Jhangiani, & I-Chant A. Chiang is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License , except where otherwise noted.

Share This Book

a measurement in research

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base

Levels of Measurement | Nominal, Ordinal, Interval and Ratio

Published on July 16, 2020 by Pritha Bhandari . Revised on June 21, 2023.

Levels of measurement, also called scales of measurement, tell you how precisely variables are recorded. In scientific research, a variable is anything that can take on different values across your data set (e.g., height or test scores).

There are 4 levels of measurement:

  • Nominal : the data can only be categorized
  • Ordinal : the data can be categorized and ranked
  • Interval : the data can be categorized, ranked, and evenly spaced
  • Ratio : the data can be categorized, ranked, evenly spaced, and has a natural zero.

Depending on the level of measurement of the variable, what you can do to analyze your data may be limited. There is a hierarchy in the complexity and precision of the level of measurement, from low (nominal) to high (ratio).

Table of contents

Nominal, ordinal, interval, and ratio data, why are levels of measurement important, which descriptive statistics can i apply on my data, quiz: nominal, ordinal, interval, or ratio, other interesting articles, frequently asked questions about levels of measurement.

Going from lowest to highest, the 4 levels of measurement are cumulative. This means that they each take on the properties of lower levels and add new properties.

Nominal level Examples of nominal scales
You can categorize your data by them in mutually exclusive groups, but there is no order between the categories.
Ordinal level Examples of ordinal scales
You can categorize and rank your data in an order, but you cannot say anything about the intervals between the rankings.

Although you can rank the top 5 Olympic medallists, this scale does not tell you how close or far apart they are in number of wins.

  (e.g., very dissatisfied to very satisfied)
Interval level Examples of interval scales
You can categorize, rank, and equal intervals between neighboring data points, but there is no true zero point.

The difference between any two adjacent temperatures is the same: one degree. But  zero degrees is defined differently depending on the scale – it doesn’t mean an absolute absence of temperature.

The same is true for test scores and personality inventories. A zero on a test is arbitrary; it does not mean that the test-taker has an absolute lack of the trait being measured.

Ratio level Examples of ratio scales
You can categorize, rank, and infer equal intervals between neighboring data points, and there is a true zero point.

A true zero means there is an absence of the variable of interest. In ratio scales, zero does mean an absolute lack of the variable.

For example, in the Kelvin temperature scale, there are no negative degrees of temperature – zero means an absolute lack of thermal energy.

Here's why students love Scribbr's proofreading services

Discover proofreading & editing

The level at which you measure a variable determines how you can analyze your data.

The different levels limit which descriptive statistics you can use to get an overall summary of your data, and which type of inferential statistics you can perform on your data to support or refute your hypothesis .

In many cases, your variables can be measured at different levels, so you have to choose the level of measurement you will use before data collection begins.

  • Ordinal level: You create brackets of income ranges: $0–$19,999, $20,000–$39,999, and $40,000–$59,999. You ask participants to select the bracket that represents their annual income. The brackets are coded with numbers from 1–3.
  • Ratio level: You collect data on the exact annual incomes of your participants.
Participant Income (ordinal level) Income (ratio level)
A Bracket 1 $12,550
B Bracket 2 $39,700
C Bracket 3 $40,300

At a ratio level, you can see that the difference between A and B’s incomes is far greater than the difference between B and C’s incomes.

Descriptive statistics help you get an idea of the “middle” and “spread” of your data through measures of central tendency and variability .

When measuring the central tendency or variability of your data set, your level of measurement decides which methods you can use based on the mathematical operations that are appropriate for each level.

The methods you can apply are cumulative; at higher levels, you can apply all mathematical operations and measures used at lower levels.

Data type Mathematical operations Measures of central tendency Measures of variability
Nominal
Ordinal
Interval
Ratio

Prevent plagiarism. Run a free check.

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

  • Student’s  t -distribution
  • Normal distribution
  • Null and Alternative Hypotheses
  • Chi square tests
  • Confidence interval

Methodology

  • Cluster sampling
  • Stratified sampling
  • Data cleansing
  • Reproducibility vs Replicability
  • Peer review
  • Likert scale

Research bias

  • Implicit bias
  • Framing effect
  • Cognitive bias
  • Placebo effect
  • Hawthorne effect
  • Hindsight bias
  • Affect heuristic

Levels of measurement tell you how precisely variables are recorded. There are 4 levels of measurement, which can be ranked from low to high:

  • Nominal : the data can only be categorized.
  • Ordinal : the data can be categorized and ranked.
  • Interval : the data can be categorized and ranked, and evenly spaced.
  • Ratio : the data can be categorized, ranked, evenly spaced and has a natural zero.

Depending on the level of measurement , you can perform different descriptive statistics to get an overall summary of your data and inferential statistics to see if your results support or refute your hypothesis .

Some variables have fixed levels. For example, gender and ethnicity are always nominal level data because they cannot be ranked.

However, for other variables, you can choose the level of measurement . For example, income is a variable that can be recorded on an ordinal or a ratio scale:

  • At an ordinal level , you could create 5 income groupings and code the incomes that fall within them from 1–5.
  • At a ratio level , you would record exact numbers for income.

If you have a choice, the ratio level is always preferable because you can analyze data in more ways. The higher the level of measurement, the more precise your data is.

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

Bhandari, P. (2023, June 21). Levels of Measurement | Nominal, Ordinal, Interval and Ratio. Scribbr. Retrieved August 5, 2024, from https://www.scribbr.com/statistics/levels-of-measurement/

Is this article helpful?

Pritha Bhandari

Pritha Bhandari

Other students also liked, descriptive statistics | definitions, types, examples, central tendency | understanding the mean, median & mode, nominal data | definition, examples, data collection & analysis, what is your plagiarism score.

Logo for M Libraries Publishing

Want to create or adapt books like this? Learn more about how Pressbooks supports open publishing practices.

5.2 Reliability and Validity of Measurement

Learning objectives.

  • Define reliability, including the different types and how they are assessed.
  • Define validity, including the different types and how they are assessed.
  • Describe the kinds of evidence that would be relevant to assessing the reliability and validity of a particular measure.

Again, measurement involves assigning scores to individuals so that they represent some characteristic of the individuals. But how do researchers know that the scores actually represent the characteristic, especially when it is a construct like intelligence, self-esteem, depression, or working memory capacity? The answer is that they conduct research using the measure to confirm that the scores make sense based on their understanding of the construct being measured. This is an extremely important point. Psychologists do not simply assume that their measures work. Instead, they collect data to demonstrate that they work. If their research does not demonstrate that a measure works, they stop using it.

As an informal example, imagine that you have been dieting for a month. Your clothes seem to be fitting more loosely, and several friends have asked if you have lost weight. If at this point your bathroom scale indicated that you had lost 10 pounds, this would make sense and you would continue to use the scale. But if it indicated that you had gained 10 pounds, you would rightly conclude that it was broken and either fix it or get rid of it. In evaluating a measurement method, psychologists consider two general dimensions: reliability and validity.

Reliability

Reliability refers to the consistency of a measure. Psychologists consider three types of consistency: over time (test-retest reliability), across items (internal consistency), and across different researchers (interrater reliability).

Test-Retest Reliability

When researchers measure a construct that they assume to be consistent across time, then the scores they obtain should also be consistent across time. Test-retest reliability is the extent to which this is actually the case. For example, intelligence is generally thought to be consistent across time. A person who is highly intelligent today will be highly intelligent next week. This means that any good measure of intelligence should produce roughly the same scores for this individual next week as it does today. Clearly, a measure that produces highly inconsistent scores over time cannot be a very good measure of a construct that is supposed to be consistent.

Assessing test-retest reliability requires using the measure on a group of people at one time, using it again on the same group of people at a later time, and then looking at test-retest correlation between the two sets of scores. This is typically done by graphing the data in a scatterplot and computing Pearson’s r . Figure 5.3 “Test-Retest Correlation Between Two Sets of Scores of Several College Students on the Rosenberg Self-Esteem Scale, Given Two Times a Week Apart” shows the correlation between two sets of scores of several college students on the Rosenberg Self-Esteem Scale, given two times a week apart. Pearson’s r for these data is +.95. In general, a test-retest correlation of +.80 or greater is considered to indicate good reliability.

Figure 5.3 Test-Retest Correlation Between Two Sets of Scores of Several College Students on the Rosenberg Self-Esteem Scale, Given Two Times a Week Apart

Test-Retest Correlation Between Two Sets of Scores of Several College Students on the Rosenberg Self-Esteem Scale, Given Two Times a Week Apart

Again, high test-retest correlations make sense when the construct being measured is assumed to be consistent over time, which is the case for intelligence, self-esteem, and the Big Five personality dimensions. But other constructs are not assumed to be stable over time. The very nature of mood, for example, is that it changes. So a measure of mood that produced a low test-retest correlation over a period of a month would not be a cause for concern.

Internal Consistency

A second kind of reliability is internal consistency , which is the consistency of people’s responses across the items on a multiple-item measure. In general, all the items on such measures are supposed to reflect the same underlying construct, so people’s scores on those items should be correlated with each other. On the Rosenberg Self-Esteem Scale, people who agree that they are a person of worth should tend to agree that that they have a number of good qualities. If people’s responses to the different items are not correlated with each other, then it would no longer make sense to claim that they are all measuring the same underlying construct. This is as true for behavioral and physiological measures as for self-report measures. For example, people might make a series of bets in a simulated game of roulette as a measure of their level of risk seeking. This measure would be internally consistent to the extent that individual participants’ bets were consistently high or low across trials.

Like test-retest reliability, internal consistency can only be assessed by collecting and analyzing data. One approach is to look at a split-half correlation . This involves splitting the items into two sets, such as the first and second halves of the items or the even- and odd-numbered items. Then a score is computed for each set of items, and the relationship between the two sets of scores is examined. For example, Figure 5.4 “Split-Half Correlation Between Several College Students’ Scores on the Even-Numbered Items and Their Scores on the Odd-Numbered Items of the Rosenberg Self-Esteem Scale” shows the split-half correlation between several college students’ scores on the even-numbered items and their scores on the odd-numbered items of the Rosenberg Self-Esteem Scale. Pearson’s r for these data is +.88. A split-half correlation of +.80 or greater is generally considered good internal consistency.

Figure 5.4 Split-Half Correlation Between Several College Students’ Scores on the Even-Numbered Items and Their Scores on the Odd-Numbered Items of the Rosenberg Self-Esteem Scale

Split-Half Correlation Between Several College Students' Scores on the Even-Numbered Items and Their Scores on the Odd-Numbered Items of the Rosenberg Self-Esteem Scale

Perhaps the most common measure of internal consistency used by researchers in psychology is a statistic called Cronbach’s α (the Greek letter alpha). Conceptually, α is the mean of all possible split-half correlations for a set of items. For example, there are 252 ways to split a set of 10 items into two sets of five. Cronbach’s α would be the mean of the 252 split-half correlations. Note that this is not how α is actually computed, but it is a correct way of interpreting the meaning of this statistic. Again, a value of +.80 or greater is generally taken to indicate good internal consistency.

Interrater Reliability

Many behavioral measures involve significant judgment on the part of an observer or a rater. Interrater reliability is the extent to which different observers are consistent in their judgments. For example, if you were interested in measuring college students’ social skills, you could make video recordings of them as they interacted with another student whom they are meeting for the first time. Then you could have two or more observers watch the videos and rate each student’s level of social skills. To the extent that each participant does in fact have some level of social skills that can be detected by an attentive observer, different observers’ ratings should be highly correlated with each other. If they were not, then those ratings could not be an accurate representation of participants’ social skills. Interrater reliability is often assessed using Cronbach’s α when the judgments are quantitative or an analogous statistic called Cohen’s κ (the Greek letter kappa) when they are categorical.

Validity is the extent to which the scores from a measure represent the variable they are intended to. But how do researchers make this judgment? We have already considered one factor that they take into account—reliability. When a measure has good test-retest reliability and internal consistency, researchers should be more confident that the scores represent what they are supposed to. There has to be more to it, however, because a measure can be extremely reliable but have no validity whatsoever. As an absurd example, imagine someone who believes that people’s index finger length reflects their self-esteem and therefore tries to measure self-esteem by holding a ruler up to people’s index fingers. Although this measure would have extremely good test-retest reliability, it would have absolutely no validity. The fact that one person’s index finger is a centimeter longer than another’s would indicate nothing about which one had higher self-esteem.

Textbook presentations of validity usually divide it into several distinct “types.” But a good way to interpret these types is that they are other kinds of evidence—in addition to reliability—that should be taken into account when judging the validity of a measure. Here we consider four basic kinds: face validity, content validity, criterion validity, and discriminant validity.

Face Validity

Face validity is the extent to which a measurement method appears “on its face” to measure the construct of interest. Most people would expect a self-esteem questionnaire to include items about whether they see themselves as a person of worth and whether they think they have good qualities. So a questionnaire that included these kinds of items would have good face validity. The finger-length method of measuring self-esteem, on the other hand, seems to have nothing to do with self-esteem and therefore has poor face validity. Although face validity can be assessed quantitatively—for example, by having a large sample of people rate a measure in terms of whether it appears to measure what it is intended to—it is usually assessed informally.

Face validity is at best a very weak kind of evidence that a measurement method is measuring what it is supposed to. One reason is that it is based on people’s intuitions about human behavior, which are frequently wrong. It is also the case that many established measures in psychology work quite well despite lacking face validity. The Minnesota Multiphasic Personality Inventory (MMPI) measures many personality characteristics and disorders by having people decide whether each of over 567 different statements applies to them—where many of the statements do not have any obvious relationship to the construct that they measure. Another example is the Implicit Association Test, which measures prejudice in a way that is nonintuitive to most people (see Note 5.31 “How Prejudiced Are You?” ).

How Prejudiced Are You?

The Implicit Association Test (IAT) is used to measure people’s attitudes toward various social groups. The IAT is a behavioral measure designed to reveal negative attitudes that people might not admit to on a self-report measure. It focuses on how quickly people are able to categorize words and images representing two contrasting groups (e.g., gay and straight) along with other positive and negative stimuli (e.g., the words “wonderful” or “nasty”). The IAT has been used in dozens of published research studies, and there is strong evidence for both its reliability and its validity (Nosek, Greenwald, & Banaji, 2006). You can learn more about the IAT—and take several of them for yourself—at the following website: https://implicit.harvard.edu/implicit .

Content Validity

Content validity is the extent to which a measure “covers” the construct of interest. For example, if a researcher conceptually defines test anxiety as involving both sympathetic nervous system activation (leading to nervous feelings) and negative thoughts, then his measure of test anxiety should include items about both nervous feelings and negative thoughts. Or consider that attitudes are usually defined as involving thoughts, feelings, and actions toward something. By this conceptual definition, a person has a positive attitude toward exercise to the extent that he or she thinks positive thoughts about exercising, feels good about exercising, and actually exercises. So to have good content validity, a measure of people’s attitudes toward exercise would have to reflect all three of these aspects. Like face validity, content validity is not usually assessed quantitatively. Instead, it is assessed by carefully checking the measurement method against the conceptual definition of the construct.

Criterion Validity

Criterion validity is the extent to which people’s scores on a measure are correlated with other variables (known as criteria ) that one would expect them to be correlated with. For example, people’s scores on a new measure of test anxiety should be negatively correlated with their performance on an important school exam. If it were found that people’s scores were in fact negatively correlated with their exam performance, then this would be a piece of evidence that these scores really represent people’s test anxiety. But if it were found that people scored equally well on the exam regardless of their test anxiety scores, then this would cast doubt on the validity of the measure.

A criterion can be any variable that one has reason to think should be correlated with the construct being measured, and there will usually be many of them. For example, one would expect test anxiety scores to be negatively correlated with exam performance and course grades and positively correlated with general anxiety and with blood pressure during an exam. Or imagine that a researcher develops a new measure of physical risk taking. People’s scores on this measure should be correlated with their participation in “extreme” activities such as snowboarding and rock climbing, the number of speeding tickets they have received, and even the number of broken bones they have had over the years. Criteria can also include other measures of the same construct. For example, one would expect new measures of test anxiety or physical risk taking to be positively correlated with existing measures of the same constructs. So the use of converging operations is one way to examine criterion validity.

Assessing criterion validity requires collecting data using the measure. Researchers John Cacioppo and Richard Petty did this when they created their self-report Need for Cognition Scale to measure how much people value and engage in thinking (Cacioppo & Petty, 1982). In a series of studies, they showed that college faculty scored higher than assembly-line workers, that people’s scores were positively correlated with their scores on a standardized academic achievement test, and that their scores were negatively correlated with their scores on a measure of dogmatism (which represents a tendency toward obedience). In the years since it was created, the Need for Cognition Scale has been used in literally hundreds of studies and has been shown to be correlated with a wide variety of other variables, including the effectiveness of an advertisement, interest in politics, and juror decisions (Petty, Briñol, Loersch, & McCaslin, 2009).

Discriminant Validity

Discriminant validity is the extent to which scores on a measure are not correlated with measures of variables that are conceptually distinct. For example, self-esteem is a general attitude toward the self that is fairly stable over time. It is not the same as mood, which is how good or bad one happens to be feeling right now. So people’s scores on a new measure of self-esteem should not be very highly correlated with their moods. If the new measure of self-esteem were highly correlated with a measure of mood, it could be argued that the new measure is not really measuring self-esteem; it is measuring mood instead.

When they created the Need for Cognition Scale, Cacioppo and Petty also provided evidence of discriminant validity by showing that people’s scores were not correlated with certain other variables. For example, they found only a weak correlation between people’s need for cognition and a measure of their cognitive style—the extent to which they tend to think analytically by breaking ideas into smaller parts or holistically in terms of “the big picture.” They also found no correlation between people’s need for cognition and measures of their test anxiety and their tendency to respond in socially desirable ways. All these low correlations provide evidence that the measure is reflecting a conceptually distinct construct.

Key Takeaways

  • Psychological researchers do not simply assume that their measures work. Instead, they conduct research to show that they work. If they cannot show that they work, they stop using them.
  • There are two distinct criteria by which researchers evaluate their measures: reliability and validity. Reliability is consistency across time (test-retest reliability), across items (internal consistency), and across researchers (interrater reliability). Validity is the extent to which the scores actually represent the variable they are intended to.
  • Validity is a judgment based on various types of evidence. The relevant evidence includes the measure’s reliability, whether it covers the construct of interest, and whether the scores it produces are correlated with other variables they are expected to be correlated with and not correlated with variables that are conceptually distinct.
  • The reliability and validity of a measure is not established by any single study but by the pattern of results across multiple studies. The assessment of reliability and validity is an ongoing process.
  • Practice: Ask several friends to complete the Rosenberg Self-Esteem Scale. Then assess its internal consistency by making a scatterplot to show the split-half correlation (even- vs. odd-numbered items). Compute Pearson’s r too if you know how.
  • Discussion: Think back to the last college exam you took and think of the exam as a psychological measure. What construct do you think it was intended to measure? Comment on its face and content validity. What data could you collect to assess its reliability, criterion validity, and discriminant validity?
  • Practice: Take an Implicit Association Test and then list as many ways to assess its criterion validity as you can think of.

Cacioppo, J. T., & Petty, R. E. (1982). The need for cognition. Journal of Personality and Social Psychology, 42 , 116–131.

Nosek, B. A., Greenwald, A. G., & Banaji, M. R. (2006). The Implicit Association Test at age 7: A methodological and conceptual review. In J. A. Bargh (Ed.), Social psychology and the unconscious: The automaticity of higher mental processes (pp. 265–292). London, England: Psychology Press.

Petty, R. E, Briñol, P., Loersch, C., & McCaslin, M. J. (2009). The need for cognition. In M. R. Leary & R. H. Hoyle (Eds.), Handbook of individual differences in social behavior (pp. 318–329). New York, NY: Guilford Press.

Research Methods in Psychology Copyright © 2016 by University of Minnesota is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License , except where otherwise noted.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings
  • My Bibliography
  • Collections
  • Citation manager

Save citation to file

Email citation, add to collections.

  • Create a new collection
  • Add to an existing collection

Add to My Bibliography

Your saved search, create a file for external citation management software, your rss feed.

  • Search in PubMed
  • Search in NLM Catalog
  • Add to Search

Measurements in quantitative research: how to select and report on research instruments

Affiliation.

  • 1 Department of Acute and Tertiary Care in the School of Nursing, University of Pittsburgh in Pennsylvania.
  • PMID: 24969252
  • DOI: 10.1188/14.ONF.431-433

Measures exist to numerically represent degrees of attributes. Quantitative research is based on measurement and is conducted in a systematic, controlled manner. These measures enable researchers to perform statistical tests, analyze differences between groups, and determine the effectiveness of treatments. If something is not measurable, it cannot be tested.

Keywords: measurements; quantitative research; reliability; validity.

PubMed Disclaimer

Similar articles

  • Clinical excellence through evidence-based practice: fatigue management as a model. Mock V. Mock V. Oncol Nurs Forum. 2003 Sep-Oct;30(5):787-96. doi: 10.1188/03.onf.787-795. Oncol Nurs Forum. 2003. PMID: 12949591 Review. No abstract available.
  • Infusing culture into oncology research on quality of life. Ashing-Giwa K, Kagawa-Singer M. Ashing-Giwa K, et al. Oncol Nurs Forum. 2006 Jan;33(1 Suppl):31-6. doi: 10.1188/06.ONF.S1.31-36. Oncol Nurs Forum. 2006. PMID: 17202087 Review.
  • Using a content analysis to identify study eligibility criteria concepts in cancer nursing research. Guo JW, Sward K, Beck S, Wong B, Staggers N, Frey L. Guo JW, et al. Comput Inform Nurs. 2014 Jul;32(7):333-42. doi: 10.1097/CIN.0000000000000061. Comput Inform Nurs. 2014. PMID: 24814997 Review.
  • Impact of a pivot nurse in oncology on patients with lung or breast cancer: symptom distress, fatigue, quality of life, and use of healthcare resources. Skrutkowski M, Saucier A, Eades M, Swidzinski M, Ritchie J, Marchionni C, Ladouceur M. Skrutkowski M, et al. Oncol Nurs Forum. 2008 Nov;35(6):948-54. doi: 10.1188/08.ONF.948-954. Oncol Nurs Forum. 2008. PMID: 18980926 Clinical Trial.
  • Parallels between research and diagnosis: the reliability and validity issues of clinical practice. Burns C. Burns C. Nurse Pract. 1991 Oct;16(10):42, 45, 49-50. Nurse Pract. 1991. PMID: 1758657 Review.
  • Quantitative tools and measurements for assessing the implementation of regulatory policies in reducing alcohol consumption and alcohol-related harms: A scoping review. Jankhotkaew J, Casswell S, Huckle T, Chaiyasong S, Phonsuk P. Jankhotkaew J, et al. Drug Alcohol Rev. 2023 Jan;42(1):157-168. doi: 10.1111/dar.13543. Epub 2022 Sep 12. Drug Alcohol Rev. 2023. PMID: 36097414 Free PMC article. Review.
  • Search in MeSH

LinkOut - more resources

Full text sources.

  • Oncology Nursing Society
  • Ovid Technologies, Inc.

Other Literature Sources

  • scite Smart Citations
  • MedlinePlus Health Information

full text provider logo

  • Citation Manager

NCBI Literature Resources

MeSH PMC Bookshelf Disclaimer

The PubMed wordmark and PubMed logo are registered trademarks of the U.S. Department of Health and Human Services (HHS). Unauthorized use of these marks is strictly prohibited.

  • Member Benefits
  • Communities
  • Grants and Scholarships
  • Student Nurse Resources
  • Member Directory
  • Course Login
  • Professional Development
  • Organizations Hub
  • ONS Course Catalog
  • ONS Book Catalog
  • ONS Oncology Nurse Orientation Program™
  • Account Settings
  • Help Center
  • Print Membership Card
  • Print NCPD Certificate
  • Verify Cardholder or Certificate Status

ONS Logo

  • Trouble finding what you need?
  • Check our search tips.

a measurement in research

  • Oncology Nursing Forum
  • Number 4 / July 2014

Measurements in Quantitative Research: How to Select and Report on Research Instruments

Teresa L. Hagan

Measures exist to numerically represent degrees of attributes. Quantitative research is based on measurement and is conducted in a systematic, controlled manner. These measures enable researchers to perform statistical tests, analyze differences between groups, and determine the effectiveness of treatments. If something is not measurable, it cannot be tested.

Jump to a section

Related articles, systematic reviews, case study research methodology in nursing research, preferred reporting items for systematic reviews and meta-analyses.

  • Affiliated Professors
  • Invited Researchers
  • J-PAL Scholars
  • Diversity, Equity, and Inclusion
  • Code of Conduct
  • Initiatives
  • Latin America and the Caribbean
  • Middle East and North Africa
  • North America
  • Southeast Asia
  • Agriculture
  • Crime, Violence, and Conflict
  • Environment, Energy, and Climate Change
  • Labor Markets
  • Political Economy and Governance
  • Social Protection
  • Evaluations

Research Resources

  • Policy Insights
  • Evidence to Policy
  • For Affiliates
  • Support J-PAL

The Abdul Latif Jameel Poverty Action Lab (J-PAL) is a global research center working to reduce poverty by ensuring that policy is informed by scientific evidence. Anchored by a network of more than 900 researchers at universities around the world, J-PAL conducts randomized impact evaluations to answer critical questions in the fight against poverty.

  • Affiliated Professors Our affiliated professors are based at 97 universities and conduct randomized evaluations around the world to design, evaluate, and improve programs and policies aimed at reducing poverty. They set their own research agendas, raise funds to support their evaluations, and work with J-PAL staff on research, policy outreach, and training.
  • Board Our Board of Directors, which is composed of J-PAL affiliated professors and senior management, provides overall strategic guidance to J-PAL, our sector programs, and regional offices.
  • Diversity, Equity, and Inclusion J-PAL recognizes that there is a lack of diversity, equity, and inclusion in the field of economics and in our field of work. Read about what actions we are taking to address this.
  • Initiatives J-PAL initiatives concentrate funding and other resources around priority topics for which rigorous policy-relevant research is urgently needed.
  • Events We host events around the world and online to share results and policy lessons from randomized evaluations, to build new partnerships between researchers and practitioners, and to train organizations on how to design and conduct randomized evaluations, and use evidence from impact evaluations.
  • Blog News, ideas, and analysis from J-PAL staff and affiliated professors.
  • News Browse news articles about J-PAL and our affiliated professors, read our press releases and monthly global and research newsletters, and connect with us for media inquiries.
  • Press Room Based at leading universities around the world, our experts are economists who use randomized evaluations to answer critical questions in the fight against poverty. Connect with us for all media inquiries and we'll help you find the right person to shed insight on your story.
  • Overview J-PAL is based at MIT in Cambridge, MA and has seven regional offices at leading universities in Africa, Europe, Latin America and the Caribbean, Middle East and North Africa, North America, South Asia, and Southeast Asia.
  • Global Our global office is based at the Department of Economics at the Massachusetts Institute of Technology. It serves as the head office for our network of seven independent regional offices.
  • Africa J-PAL Africa is based at the Southern Africa Labour & Development Research Unit (SALDRU) at the University of Cape Town in South Africa.
  • Europe J-PAL Europe is based at the Paris School of Economics in France.
  • Latin America and the Caribbean J-PAL Latin America and the Caribbean is based at the Pontificia Universidad Católica de Chile.
  • Middle East and North Africa J-PAL MENA is based at the American University in Cairo, Egypt.
  • North America J-PAL North America is based at the Massachusetts Institute of Technology in the United States.
  • South Asia J-PAL South Asia is based at the Institute for Financial Management and Research (IFMR) in India.
  • Southeast Asia J-PAL Southeast Asia is based at the Faculty of Economics and Business at the University of Indonesia (FEB UI).
  • Overview Led by affiliated professors, J-PAL sectors guide our research and policy work by conducting literature reviews; by managing research initiatives that promote the rigorous evaluation of innovative interventions by affiliates; and by summarizing findings and lessons from randomized evaluations and producing cost-effectiveness analyses to help inform relevant policy debates.
  • Agriculture How can we encourage small farmers to adopt proven agricultural practices and improve their yields and profitability?
  • Crime, Violence, and Conflict What are the causes and consequences of crime, violence, and conflict and how can policy responses improve outcomes for those affected?
  • Education How can students receive high-quality schooling that will help them, their families, and their communities truly realize the promise of education?
  • Environment, Energy, and Climate Change How can we increase access to energy, reduce pollution, and mitigate and build resilience to climate change?
  • Finance How can financial products and services be more affordable, appropriate, and accessible to underserved households and businesses?
  • Firms How do policies affecting private sector firms impact productivity gaps between higher-income and lower-income countries? How do firms’ own policies impact economic growth and worker welfare?
  • Gender How can we reduce gender inequality and ensure that social programs are sensitive to existing gender dynamics?
  • Health How can we increase access to and delivery of quality health care services and effectively promote healthy behaviors?
  • Labor Markets How can we help people find and keep work, particularly young people entering the workforce?
  • Political Economy and Governance What are the causes and consequences of poor governance and how can policy improve public service delivery?
  • Social Protection How can we identify effective policies and programs in low- and middle-income countries that provide financial assistance to low-income families, insuring against shocks and breaking poverty traps?

Introduction to randomized evaluations

The elements of a randomized evaluation

Teaching resources on randomized evaluations

Resources for researchers new to randomized evaluations

Formalize research partnership and establish roles and expectations

Assessing viability and building relationships

Checklist for launching a randomized evaluation in the United States

Administrative steps for launching a randomized evaluation in the United States

Ethical conduct of randomized evaluations

Institutional Review Board (IRB) proposals

Power calculations

Quick guide to power calculations

Grant proposals

Grant and budget management

Trial registration

Pre-analysis plans

Resources for conducting remote surveys

Introduction to measurement and indicators

Survey design

Repository of measurement and survey design resources

Design and iterate implementation strategy

Define intake and consent process

Implementation monitoring

Randomization

Real-time monitoring and response plans: Creating procedures

Increasing response rates of mail surveys and mailings

Survey programming

Data quality checks

Survey logistics

Surveyor hiring and training

Field team management

Working with a third-party survey firm

Questionnaire piloting

Using administrative data for randomized evaluations

Evaluating technology-based interventions

Data security procedures for researchers

Data cleaning and management

Data visualization

Data analysis

Conducting cost-effectiveness analysis (CEA)

Pre-publication planning and proofing

Data de-identification

Data publication

Communicating with a partner about results

Coding resources for randomized evaluations

The goal of measurement is to get reliable data with which to answer research questions and assess theories of change. Inaccurate measurement can lead to unreliable data, from which it is difficult to draw valid conclusions. This section covers key measurement concepts, means of data collection, from whom data should be collected, and common sources of measurement error. The following section, survey design , applies these concepts to designing survey questions and answers. See also our repository of measurement and survey design resources that introduce readers to the measurement tools, difficulties, and solutions in a range of topics. 

Testing your theory of change (ToC) requires data on needs, inputs, outputs, intermediary outcomes, and impact. 

You should collect data on covariates (particularly if you believe there will be heterogeneous treatment effects), predictors of compliance, and measures of actual treatment compliance. It can be helpful to collect information on context, cost effectiveness, and qualitative information (the “why” and the “how”).

Secondary data (e.g., administrative data , census data) can complement or even substitute for primary data collection. If you plan to use secondary data, be sure to consider how you will integrate the two data sources when you design your survey.

Measurement problems arise when:

Constructs (the concept to be measured) are vague or poorly defined

Indicators are imperfect measures of the underlying construct

Individual responses are skewed, e.g., through distraction, illness, poor comprehension, etc.

Human/device errors lead to erroneous values

There are many types of measurement error, here grouped into three broad categories:

Question issues: These arise due to poorly formed questions. Concerns include vagueness, negatives, double-barrel questions, presumptions, overlapping categories, and framing effects (within questions and within the questionnaire as a whole).

Response-related issues: These arise due to use of incomplete or overlapping categories in responses, i.e., errors around construction of answer choices.

Respondent issues: These arise due to the respondent’s inherent biases while answering to questions. They include recall bias, anchoring, partiality, social desirability bias, telescoping, reporting bias, and errors of differential response.

Bias that is correlated with the treatment is more serious than bias that is not.

The purpose and practice of measurement

Some first order decisions:

What do you need to measure?

  • What type of data is needed to measure this?
  • How should it be collected?
  • Who should you collect it from?

You will need to gather information to test your assumptions along every step of your theory of change (ToC) from inputs to the final results. Make sure you are clear about the theoretical roadmap for your project in your particular context. Data is needed at any point where the logical chain of your ToC might break down.

Theory of change image

It is often useful to also collect data on the following:

Covariates: If you believe the treatment effect may vary according to certain characteristics of participants or the study location, such as gender, urban vs. rural location, distance to nearest hospital, etc., be sure to collect data on these characteristics. It is useful to decide in advance whether to stratify on certain characteristics, which will help ensure you are powered to detect heterogeneous treatment effects. Note that this will affect other planning decisions: if stratifying, be sure to account for this in power calculations and budgeting . If writing a pre-analysis plan , the covariates for stratification should be included. See Angelucci, Karlan and Zinman (2013) for an example of a paper that measures heterogeneous treatment effects.

Predictors of compliance and measures of actual treatment compliance (at the individual and group level): For example, for a medical intervention involving taking tablets, we might want to gather information at baseline on whether the respondent currently takes any tablets daily—and how often they remember to do so—and then at endline on how often they took their tablets.

Context (for external validity/generalizability): Examples include exposure to other, similar programs, proximity of schools/hospitals if relevant, etc. This is covered at length in the lecture on Measurement Outcomes, Impacts, and Indicators from the Evaluating Social Programs course.

Costs : Calculating cost effectiveness requires data on the costs of the intervention itself, as well as data on the price of any substitute products, any savings realized/costs incurred by the respondents due to the intervention, and so on. For more information, see J-PAL’s cost effectiveness analysis guide .

Qualitative information (the “why” and the “how”): for example, if a respondent answers at endline that their child has attended school more frequently than in the past then a follow-up question could ask about the main reason for this.

Most surveys are far too long, which can lead to respondent fatigue and therefore poor data quality. You should measure exactly what you need to answer your research questions—or potential future research questions—but not go beyond that. Do not include questions if you do not have a clear idea of how they would be used or if they do not relate to a specific research question. See more information on survey design here.

What type of data is needed?

If using primary data , or data collected yourself, there are a number of further initial decisions to be made:

From a person versus automatically generated ?

Data collected with people can include surveys, exams, games, vignettes, direct observation, diaries/logs, focus groups, or interviews

Automatically generated data is collected from a machine or process, e.g., temperature, windspeed

From existing surveys or from new survey instruments ?

Basing new surveys on existing surveys that have successfully captured outcomes before saves time and resources (the survey has already been developed, tested, and used), and may mean your instrument is less likely to contain problems. Sources of existing surveys can be found in the resource on survey design.

New surveys can be tailored to your precise specifications but require time to develop and test. See more in the resource on survey design .

Cross-sectional or panel survey ?

Most RCTs use panel data, since the aim is to follow the outcomes of individuals who have/have not received a specific treatment, but there may be an argument for using cross-sectional data if you believe there is a strong probability that participants will change behavior by virtue of participating in the study. See the implementation monitoring resource for more information. 

Survey data can be collected using a variety of methods including in-person interviews , phone or web-based interviews , and self-administered modules . These methods are discussed further in the resources on survey design and conducting remote surveys .  While these resources focus on collecting data for quantitative analysis, other related data collection activities include:

Census : Usually, you will be surveying a sample of the larger target population (sampling frame). To obtain a representative sample, you will need to sample from the larger group by either using available census data or collecting your own. This is described further in the resource on randomization .

Qualitative work can be used to inform intervention design, understand how implementation is going, identify the context around noncompliance or attrition, and more. Qualitative data techniques include semi-structured interviews, assessments, games, observations, and focus group discussions. An application of qualitative research methods to gender-focused research is discussed in J-PAL’s Practical Guide to Measuring Girls’ and Women’s Empowerment in Impact Evaluations .

Secondary data can be administrative data (i.e., records kept by governments and other organizations, such as births, deaths, tax returns, exam results, etc. for operations) or non-administrative data (i.e., data gathered for research or non-administrative purposes). Key considerations when using secondary data include:

Does the data exist, and is it accessible? 

If the data is not accessible, is there an established process for negotiating or applying for access?

What is the date range, and has the data been collected consistently across that date range?

Does the dataset cover the population of interest? Is there a risk of uneven or biased coverage?

Does the dataset cover the outcomes of interest?

Is the dataset reliable and unlikely to have been manipulated?

For more information on administrative data—including why and how to use it, and possible sources of bias—see the  Using Administrative Data for Randomized Evaluations  and Evaluating Technology-based Interventions  resources. J-PAL's IDEA Handbook provides in-depth technical information on accessing, protecting, and working with administrative data. Some secondary data sources are listed at the end of this guide. 

Who do you collect the data from?

The target respondent:

Should be the most informed person about your outcomes of interest

May vary across modules. For example, in specialized modules you may want to target the household head, the person responsible for cooking, the primary income earner, or women of reproductive age.

Measurement concepts

Gathering good data requires thinking carefully about what exactly you are trying to measure and the best way to get at this information. Sometimes, the questions needed to gather even seemingly simple data (e.g., household size) can be quite complex. 

Definitions

Constructs : A concept that can be measured. It may be abstract and can have multiple definitions. Examples include comprehension, crime, or income.

Indicators : A way to measure and monitor a given milestone, outcome, or construct and help determine if our assumptions are correct. Examples include math test scores, reported burglaries, or daily wages.

The relation between constructs, indicators and data is illustrated below: 1

Constructs indicators and data

Measurement issues can come into play at all of these levels:

The construct may have multiple facets or valid definitions, making it a poor measure of the underlying concept of interest (e.g., there are many different facets of intelligence, such as emotional intelligence, logical intelligence or linguistic intelligence, and we may run into issues unless we are clear which of these is most important for the research question).

The indicators used to measure the construct may be imperfect.

Many factors determine an individual’s responses to a survey or test (e.g., level of distraction, hunger, illness).

Human/device error may lead to an erroneous value being recorded.

 The goal of measurement is to gather data that has both high validity and high reliability .

Validity: Measuring the right thing

Key question: How well does the indicator map to the outcome, i.e., is it an unbiased, accurate measure of the outcome?

As an example, suppose you are using income as a measure of individuals’ sense of financial security. There are many reasons why income may not map perfectly to feelings of financial security:

Variation over time: a highly unstable income will most likely offer less financial security than a stable income of equal average value.

Expenses: financial security is arguably most closely related to savings, whereas income will only contribute to savings if it exceeds expenses. 

Reliability: Measuring the thing precisely

Key question: Is the measure precise or noisy?

As an example, consider how measuring income per day through a recall question of earnings in the past day versus the past week will affect responses:

The former will be more variable than the latter because of day-to-day variation in income, and any day-level shocks that may have happened to occur the day before the survey.

However, measuring income over the past week introduces greater potential for recall bias, potentially leading to systematic measurement error (e.g. if respondents tend to omit certain categories of income that are hard to recall). See below for more on systematic measurement error.

Reliability and validity

Proxy indicators

Proxy indicators may be used when constructs, or the main concepts being investigated (such as crime or income), are hard to measure. Proxy indicators must be:

Correlated with the indicator (and the higher the correlation the better the proxy) 

Able to change in tandem with the construct. For example, gender is a poor proxy for earnings, despite the correlation, as it will generally not be changed through an intervention.

More on common proxies, including the PPI (Poverty Probability Index), can be found in the resource on survey design .

Minimizing measurement error

Measurement error occurs when the response provided by a respondent differs from the real or true value. Errors may be random or systematic. Random error will average out to zero, while systematic error will not. It is important to distinguish measurement error, which occurs during the data collection process, from validity error, which occurs when the indicators do not appropriately map to the concept of interest. For more information on designing surveys to minimize measurement error, see the survey design resource.

The process of answering a question

Before introducing the sources of measurement error it is helpful to consider the steps that a respondent goes through when answering a question:

The process of answering a question

For example, a respondent asked how many times they ate rice this month must:

Understand and interpret the question. 

Think about when they eat rice.

Add up how many days that is per week and per month.

Give their answer (or fit it within the answer options).

Bias can creep in during each of these steps. For example:

Respondents may have different interpretations of “rice,” “times,” “consume,” and “this month.” For example, does “this month” refer to the past 30 days or the month in which the respondent is being interviewed (e.g., the month of June)?

Respondents may interpret this as just eating grains of rice but not consider rice products such as noodles, milk, pancakes, etc.

Respondents may make errors in calculation, compounded by differences in interpretation in the previous step.

Respondents may decide to give an incorrect estimate (perhaps due to bias - people may consider rice unhealthy and therefore claim to eat less).

Types of measurement error

There are several different sources of measurement error.

Graphic depicting different sources of measurement error

These types of measurement errors can be further grouped into 1) question issues, 2) response-related issues, and 3) respondent issues.

Issue Example Tip
can result in respondents interpreting questions in different ways. “How many times did you consume rice this month?”

Respondents may have different interpretations of "rice," "times," "consume," and "this month." For example, does "this month" refer to the past 30 days or the month in which the respondent is being interviewed (e.g., the month of June)?
Look at each word in your question carefully, brainstorm alternate meanings and define any ambiguous concept. This is particularly important with abstract concepts, e.g., empowerment, risk aversion, or trust.
can be confusing and lead to misinterpretations "Many people regularly do not eat at least one meal per week. For how many weeks in the last year was this not true for your family?” Avoid using negatives wherever possible
: When a question has multiple parts, it may not be clear which part the question respondents are answering "Should the government provide free education because school is too expensive in our community?" Avoid convoluted sentence structures, and break complex questions into their constituent parts
about the respondent can threaten data quality. "How would you rate the quality of the coffee this morning?" Try not to make assumptions (here, the question assumes the respondent drinks coffee); use filters and skip patterns wherever possible, and make sure there is a “not applicable” option for each question where relevant.
 The way individuals react to choices will depend on how they are presented. Individuals may give very different answers to the following two questions:

1. Hitting your child to discipline them is illegal in your country. Have you ever hit your child to discipline them?

2. Many people think that physically disciplining their child is an effective way to teach them how to behave. Have you ever hit your child to discipline them?
Try to be as neutral as possible when framing questions.
The way individuals answer a question may depend on which questions they have already answered. If a respondent has just answered a series of questions about education issues in her village she may be more likely to select education as her top policy priority. Be careful of where questions are placed and consider randomly varying the order of questions if you are concerned about a framing effect.
Issue Example Tip
: Errors of completeness happen when respondents cannot find an appropriate response category. Any question about education must include “no education." Ensure that your response categories are fully exhaustive, and include “don’t know,” “prefer not to reply,” and “other (specify)” wherever relevant. Extensive questionnaire piloting will also help identify incompleteness in response options.
When categories overlap there may be multiple ways that a respondent can answer a question. If categories run 0-5, 5-10, 10-15, etc., then respondents whose answer is 5 have two possible categories they should choose. Ensure that all categories are mutually exclusive.
Issue Example of sub-optimal approach Example of better approach
 Individuals may vary in the accuracy or completeness of their recollections. One way around this is to ask respondents to record information in real time. “What did you eat for dinner on Tuesday 3 weeks ago?” “I asked you to keep a food diary to record what you eat every day. Could you show me your food diary for Tuesday 3 weeks ago?”
 Individuals tend to rely too heavily on the first (or sometimes most recent) piece of information they see and will be more likely to give an answer that is close to that information. Avoid adding anchors to questions wherever possible. “Most people have 3 meals per day. How many meals per day do you think is normal?” “People vary in the number of meals they consume per day. How many meals per day do you think is normal?”
: Respondents may be biased if a question is framed to suggest a particular answer—especially if the question or answer implies approval of one response over others. Frame all questions as neutrally as possible Candidate X has fantastic policies around health and education. Would you consider voting for Candidate X? Would you consider voting for Candidate X?
: Respondents will tend to answer questions in a manner that is favorable to others, i.e., emphasize strengths, hide flaws, or avoid stigma. They may be reluctant to admit to behavior that is not generally approved of. Try to ask questions indirectly, ensure that respondents have complete privacy (and remind them of it!) and try to make sensitive questions less specific. Additional information on asking sensitive questions can be found in the and sections as well as our . Hitting your child to discipline them is illegal in your country. Have you ever hit your child to discipline them? People have different strategies for teaching discipline to their children. Have you ever hit your child to discipline them?
: People tend to perceive recent events as being more remote (backward telescoping) and distant events as being more recent (forward telescoping), which can lead to over- or under-reporting. What big purchases have you made in the last year? What big purchases have you made since the 20th January last year? Please don’t include any purchases you made before that.

Note that If you asked the same question at baseline it is even better to say something like: When I visited you before you said you had bought X and Y in the last year. What big purchases have you made since then?
 Individuals may choose the first in a list of options that sounds acceptable without listening to the rest—or they may choose the last one as it’s the most recent and easiest to remember. Options to get around this include limiting the length of question/number of answers and randomizing response order for lists. Always using the same question ordering or allowing respondents to pick an answer before hearing all of the options. Randomize the question ordering and insist that the respondents hear the whole list before choosing an option; or instruct enumerators to not read out the answer options, and train them to select those options that best reflect the respondent’s answer.
Respondents have incentive to misreport if their answers may determine qualification for a program or whether they meet certain requirements. If a certain level of school attendance is required to qualify for a government program then respondents may overstate the amount that their children attended school. Stress anonymity/privacy, or use proxy measures or direct observation (e.g., the school’s attendance records) rather than self-reported answers.
 The intervention itself may cause the treatment (or control) group to be more likely to record certain events, more likely to respond to a question, or more likely to appear in administrative records An intervention aims to decrease the incidence of a disease in a population through an innovative treatment, and the intervention involves a campaign to increase the number of individuals who go to the doctor. At endline, it appears that the incidence of the disease has increased, when in reality the campaign to get people to the hospital was successful, so more cases of the disease were recorded. Be sure that the ability to measure an outcome is not correlated with the treatment assignment (e.g., any campaign to increase hospital attendance in treatment villages should also take place in control villages) and identify how the intervention may affect the response process and choose variables that are less susceptible to bias/easier to verify.

Administrative data

Administrative data may suffer from the same sources of bias as survey data. As the researcher does not have a say in the data collection and processing phase, additional work may be needed to assess data accuracy. Common types of bias in administrative data include:

Reporting bias: As with primary data collection, respondents may have incentive to over- or under-report. An individual may under-report income to qualify for a social welfare program, while an administrative organization such as a school may overreport attendance to meet requirements. While the incentives to misreport may be stronger than with survey data, the problem is mitigated by the fact that much administrative data is not self-reported. To address reporting bias:

Identify the context in which the data were collected. Were there incentives to misreport information?

Choose variables that are not susceptible to bias (e.g., hospital visit rather than value of insurance claim)

Differential coverage: In addition to the issues listed above, in administrative data there may be additional differential coverage between those in the treatment vs control groups: i) differential ability to link individuals to administrative records and ii) differential probability of appearing in administrative records (e.g., victimization as measured by calls to report a crime).

Selection bias in administrative data occurs when administrative records only exist for individuals or organizations in contact with the administration in question. This could occur with program recipients, applicants, partner schools and hospitals, and so on. 

Ask: what is the reason for the organization to collect this data?

To address differential coverage and selection bias:

Identify the data universe

Which individuals are included in the data and which are excluded, and why?

Identify how the intervention may affect the reporting of outcomes

Determine the direction in which differential selection might occur and how this might bias effect estimates.

Collect a baseline survey with identifiers for linking

This will ensure that you are equally likely to link treatment and control individuals to their records and identify differential coverage.

How serious is measurement error?

The severity of measurement error depends on the type and extent of error, as well as whether the bias is correlated with the treatment. 

Bias that is uncorrelated with the treatment affects both the treatment and control equally, and so will not bias the estimate of the difference between the two groups at endline. 

Bias that is correlated with the treatment is more serious: it affects the treatment and control groups differently, meaning that the estimate of the difference between the groups at endline is biased on average. This might lead to an erroneous conclusion about the sign and magnitude of the treatment effect.

Secondary data resources

Administrative data resources:

  • National-level administrative data can be found on the website of the ILO (labour data), World Bank (e.g., World Development Indicators including population data), UN (e.g., SDGs and trade) and national statistics authorities
  • NASA/NOAA weather data
  • J-PAL North America catalog of administrative datasets (US focus)
  • Credit reporting agency data is available from Equifax , Experian , and TransUnion (US focus)
  • The Research Data Assistance Center ( ResDAC ) provides information and assistance with applying for access to data from the Centers for Medicare and Medicaid Services
  • Researchers have compiled an inventory of data sets used to study education
  • The American Economic Association hosts resources enumerating sources and procedures for accessing US federal administrative data . 
  • Google’s Dataset Search tool that “enables users to find datasets stored across thousands of repositories on the Web, making these datasets universally accessible and useful.”

Non-administrative data:

  • J-PAL/IPA Datahub for Field Experiments in Economics and Public Policy
  • World Bank microdata catalogue
  • IFPRI microdata catalogue
  • The Guardian has compiled a list of existing datasets  that may be of interest to international development researchers and practitioners.

Last updated February 2022.

These resources are a collaborative effort. If you notice a bug or have a suggestion for additional content, please fill out this form . 

We thank Liz Cao ,  Ben Morse  and Katharina Kaeppel  for helpful comments. All errors are our own.

​​​​​The Questionnaire Design section of the World Bank’s DIME Wiki, including:

  • Survey Instruments Design & Pilot
  • Preparing for Data Collection
  • Survey Guidelines
  • Guidelines on survey design and pilot

Grosh and Glewwe’s Designing Household Survey Questionnaires for Developing Countries: Lessons from 15 Years of the Living Standards Measurement Study

McKenzie’s Three New Papers Measuring Stuff that is Difficult to Measure and Using BDM and TIOLI to measure the demand for business training in Jamaica via The World Bank’s Development Impact Blog

J-PAL’s Practical Guide to Measuring Girls’ and Women’s Empowerment in Impact Evaluations  ​​​​​​

Bradburn, N. M., Sudman, S., & Wansink, B. (2004). Asking questions: The definitive guide to questionnaire design : for market research, political polls, and social and health questionnaires (Rev). San Francisco: Jossey-Bass.

Deaton, A., & Zaidi, S. (2002). Guidelines for constructing consumption aggregates for welfare analysis (No. 135). World Bank Publications.

Deaton, Angus S., Measuring Poverty (July 2004). Princeton Research Program in Development Studies Working Paper. Available at SSRN: https://ssrn.com/abstract=564001 or http://dx.doi.org/10.2139/ssrn.564001

Fowler, F. J. (op. 1995). Improving survey questions: Design and evaluation. Thousand Oaks [etc.]: Sage.

Marsden, P. V., & Wright, J. D. (2010). Handbook of survey research (2nd). Bingley, UK: Emerald.

Saris, W. E., & Gallhofer, I. N. (2007). Design, evaluation, and analysis of questionnaires for survey research. Wiley series in survey methodology. Hoboken, N.J: Wiley-Interscience.

Tourangeau, R., Rips, L. J., & Rasinski, K. A. (2000). The psychology of survey response. Cambridge, U.K, New York: Cambridge University Press.

Abay, Kibrom A., Leah EM Bevis, and Christopher B. Barrett. "Measurement Error Mechanisms Matter: Agricultural intensification with farmer misperceptions and misreporting." American Journal of Agricultural Economics (2019).

Bursztyn, L., M. Callen, B. Ferman, A. Hasanain, & A. Yuchtman. 2014. "A revealed preference approach to the elicitation of political attitudes: experimental evidence on anti-Americanism in Pakistan."  NBER Working Paper No. 20153.

Feeney, Laura (with assistance from Sachsse, Clare). “ Measurement ." Lecture, Delivered in J-PAL North America 2019 Research Staff Training (J-PAL internal resource) 

Glennerster , Rachel and Kudzai Takavarasha. 2013.   Running Randomized Evaluations: A Practical Guide . Princeton University Press: Princeton, NJ.

Karlan, Dean. “ 3.2 Measuring Sensitive Topics ." (J-PAL internal resource) 

Sachsse, Clare. “ Theory of Change and Outcomes Measurement [California Franchise Tax Board / CA FTB]”, Delivered in J-PAL’s May 2019 CA FTB training. (J-PAL internal resource) 

Sadamand, Nomitha. “ Measuring Better: What to measure, and How?” Lecture, Delivered in J-PAL South Asia’s 2019 Measurement and Survey Design Course. (J-PAL internal resource) 

Sautmann, Anja. “ Measurement .” Lecture, Delivered in J-PAL North America’s 2018 Evaluating Social Programs Exec Ed Training.

Sudman, S. & N. Bradburn. 1982. Asking Questions: a Practical Guide to Questionnaire Design . A Wiley Imprint.

In this resource

Popular searches

  • How to Get Participants For Your Study
  • How to Do Segmentation?
  • Conjoint Preference Share Simulator
  • MaxDiff Analysis
  • Likert Scales
  • Reliability & Validity

Request consultation

Do you need support in running a pricing or product study? We can help you with agile consumer research and conjoint analysis.

Looking for an online survey platform?

Conjointly offers a great survey tool with multiple question types, randomisation blocks, and multilingual support. The Basic tier is always free.

Research Methods Knowledge Base

  • Navigating the Knowledge Base
  • Foundations
  • Construct Validity
  • Reliability

Levels of Measurement

  • Survey Research
  • Scaling in Measurement
  • Qualitative Measures
  • Unobtrusive Measures
  • Research Design
  • Table of Contents

Fully-functional online survey tool with various question types, logic, randomisation, and reporting for unlimited number of surveys.

Completely free for academics and students .

The level of measurement refers to the relationship among the values that are assigned to the attributes for a variable. What does that mean? Begin with the idea of the variable, in this example “party affiliation.”

That variable has a number of attributes. Let’s assume that in this particular election context the only relevant attributes are “republican”, “democrat”, and “independent”. For purposes of analyzing the results of this variable, we arbitrarily assign the values 1 , 2 and 3 to the three attributes. The level of measurement describes the relationship among these three values. In this case, we simply are using the numbers as shorter placeholders for the lengthier text terms. We don’t assume that higher values mean “more” of something and lower numbers signify “less”. We don’t assume the value of 2 means that democrats are twice something that republicans are. We don’t assume that republicans are in first place or have the highest priority just because they have the value of 1 . In this case, we only use the values as a shorter name for the attribute. Here, we would describe the level of measurement as “nominal”.

Why is Level of Measurement Important?

First, knowing the level of measurement helps you decide how to interpret the data from that variable. When you know that a measure is nominal (like the one just described), then you know that the numerical values are just short codes for the longer names. Second, knowing the level of measurement helps you decide what statistical analysis is appropriate on the values that were assigned. If a measure is nominal, then you know that you would never average the data values or do a t-test on the data.

There are typically four levels of measurement that are defined:

In nominal measurement the numerical values just “name” the attribute uniquely. No ordering of the cases is implied. For example, jersey numbers in basketball are measures at the nominal level. A player with number 30 is not more of anything than a player with number 15 , and is certainly not twice whatever number 15 is.

In ordinal measurement the attributes can be rank-ordered. Here, distances between attributes do not have any meaning. For example, on a survey you might code Educational Attainment as 0=less than high school; 1=some high school.; 2=high school degree; 3=some college; 4=college degree; 5=post college. In this measure, higher numbers mean more education. But is distance from 0 to 1 same as 3 to 4? Of course not. The interval between values is not interpretable in an ordinal measure.

In interval measurement the distance between attributes does have meaning. For example, when we measure temperature (in Fahrenheit), the distance from 30-40 is same as distance from 70-80. The interval between values is interpretable. Because of this, it makes sense to compute an average of an interval variable, where it doesn’t make sense to do so for ordinal scales. But note that in interval measurement ratios don’t make any sense - 80 degrees is not twice as hot as 40 degrees (although the attribute value is twice as large).

Finally, in ratio measurement there is always an absolute zero that is meaningful. This means that you can construct a meaningful fraction (or ratio) with a ratio variable. Weight is a ratio variable. In applied social research most “count” variables are ratio, for example, the number of clients in past six months. Why? Because you can have zero clients and because it is meaningful to say that “…we had twice as many clients in the past six months as we did in the previous six months.”

It’s important to recognize that there is a hierarchy implied in the level of measurement idea. At lower levels of measurement, assumptions tend to be less restrictive and data analyses tend to be less sensitive. At each level up the hierarchy, the current level includes all of the qualities of the one below it and adds something new. In general, it is desirable to have a higher level of measurement (e.g. interval or ratio) rather than a lower one (nominal or ordinal).

Cookie Consent

Conjointly uses essential cookies to make our site work. We also use additional cookies in order to understand the usage of the site, gather audience analytics, and for remarketing purposes.

For more information on Conjointly's use of cookies, please read our Cookie Policy .

Which one are you?

I am new to conjointly, i am already using conjointly.

Scales of Measurement in Research

  • January 2021

Anjana B.S. at University of Kerala

  • University of Kerala

Abstract and Figures

Scales of Measurement

Discover the world's research

  • 25+ million members
  • 160+ million publication pages
  • 2.3+ billion citations
  • Cooper Orrin
  • Liu Guoqing
  • Stanley Smith Stevens
  • Formplus Blog
  • Recruit researchers
  • Join for free
  • Login Email Tip: Most researchers use their institutional email address as their ResearchGate login Password Forgot password? Keep me logged in Log in or Continue with Google Welcome back! Please log in. Email · Hint Tip: Most researchers use their institutional email address as their ResearchGate login Password Forgot password? Keep me logged in Log in or Continue with Google No account? Sign up

Accessibility Links

  • Skip to content
  • Skip to search IOPscience
  • Skip to Journals list
  • Accessibility help
  • Accessibility Help

Click here to close this panel.

Measurement Science and Technology

Purpose-led Publishing is a coalition of three not-for-profit publishers in the field of physical sciences: AIP Publishing, the American Physical Society and IOP Publishing.

Together, as publishers that will always put purpose above profit, we have defined a set of industry standards that underpin high-quality, ethical scholarly communications.

We are proudly declaring that science is our only shareholder.

a measurement in research

Launched in 1923 Measurement Science and Technology was the world's first scientific instrumentation and measurement journal and the first research journal produced by the Institute of Physics. It covers all aspects of the theory, practice and application of measurement, instrumentation and sensing across science and engineering.

Open all abstracts , in this tab

Hamidreza Eivazi et al 2024 Meas. Sci. Technol. 35 075303

High-resolution reconstruction of flow-field data from low-resolution and noisy measurements is of interest due to the prevalence of such problems in experimental fluid mechanics, where the measurement data are in general sparse, incomplete and noisy. Deep-learning approaches have been shown suitable for such super-resolution tasks. However, a high number of high-resolution examples is needed, which may not be available for many cases. Moreover, the obtained predictions may lack in complying with the physical principles, e.g. mass and momentum conservation. Physics-informed deep learning provides frameworks for integrating data and physical laws for learning. In this study, we apply physics-informed neural networks (PINNs) for super-resolution of flow-field data both in time and space from a limited set of noisy measurements without having any high-resolution reference data. Our objective is to obtain a continuous solution of the problem, providing a physically-consistent prediction at any point in the solution domain. We demonstrate the applicability of PINNs for the super-resolution of flow-field data in time and space through three canonical cases: Burgers' equation, two-dimensional vortex shedding behind a circular cylinder and the minimal turbulent channel flow. The robustness of the models is also investigated by adding synthetic Gaussian noise. Furthermore, we show the capabilities of PINNs to improve the resolution and reduce the noise in a real experimental dataset consisting of hot-wire-anemometry measurements. Our results show the adequate capabilities of PINNs in the context of data augmentation for experiments in fluid mechanics.

Simon Laflamme et al 2023 Meas. Sci. Technol. 34 093001

Structural health monitoring (SHM) is the automation of the condition assessment process of an engineered system. When applied to geometrically large components or structures, such as those found in civil and aerospace infrastructure and systems, a critical challenge is in designing the sensing solution that could yield actionable information. This is a difficult task to conduct cost-effectively, because of the large surfaces under consideration and the localized nature of typical defects and damages. There have been significant research efforts in empowering conventional measurement technologies for applications to SHM in order to improve performance of the condition assessment process. Yet, the field implementation of these SHM solutions is still in its infancy, attributable to various economic and technical challenges. The objective of this Roadmap publication is to discuss modern measurement technologies that were developed for SHM purposes, along with their associated challenges and opportunities, and to provide a path to research and development efforts that could yield impactful field applications. The Roadmap is organized into four sections: distributed embedded sensing systems, distributed surface sensing systems, multifunctional materials, and remote sensing. Recognizing that many measurement technologies may overlap between sections, we define distributed sensing solutions as those that involve or imply the utilization of numbers of sensors geometrically organized within (embedded) or over (surface) the monitored component or system. Multi-functional materials are sensing solutions that combine multiple capabilities, for example those also serving structural functions. Remote sensing are solutions that are contactless, for example cell phones, drones, and satellites. It also includes the notion of remotely controlled robots.

Malcolm A Lawn et al 2024 Meas. Sci. Technol. 35 105018

Precise control of advanced materials relies on accurate dimensional metrology at the sub-nanometre scale. At this scale, the accuracy of scanning probe microscopy (SPM) has been limited by the lack of traceable transfer standard artefacts with calibration structures of suitable dimensions. With the adoption in 2019 of the silicon crystal lattice spacing as a secondary realization of the metre in the International System of Units (SI), SPM users have direct access to a realization of the SI metre at the sub-nanometre level by means of the step height of self-assembled monatomic lattice steps that can form on the surface of silicon crystals. A key challenge of successfully adopting this pathway is establishing protocols to minimize measurement errors and artefacts in routine laboratory use. In this study, step height measurements of monoatomic lattice steps in an ordinal/staircase structure on a Si(111) crystal surface have been derived from images acquired with a commercially available, research-level atomic force microscope (AFM). Measurement results derived from AFM images using three different SPM image processing and analysis software packages are compared. Significant sources of measurement uncertainty are identified, principally the contribution from the dependence on scan direction. The calibration of the AFM derived from this measurement was used to traceably measure the sub-nanometre lattice steps on a silicon carbide crystal surface to demonstrate the viability of this calibration pathway.

Luigi Ribotta et al 2024 Meas. Sci. Technol. 35 105014

Adam Thompson et al 2021 Meas. Sci. Technol. 32 105013

Maximum permissible errors (MPEs) are an important measurement system specification and form the basis of periodic verification of a measurement system's performance. However, there is no standard methodology for determining MPEs, so when they are not provided, or not suitable for the measurement procedure performed, it is unclear how to generate an appropriate value with which to verify the system. Whilst a simple approach might be to take many measurements of a calibrated artefact and then use the maximum observed error as the MPE, this method requires a large number of repeat measurements for high confidence in the calculated MPE. Here, we present a statistical method of MPE determination, capable of providing MPEs with high confidence and minimum data collection. The method is presented with 1000 synthetic experiments and is shown to determine an overestimated MPE within 10% of an analytically true value in 99.2% of experiments, while underestimating the MPE with respect to the analytically true value in 0.8% of experiments (overestimating the value, on average, by 1.24%). The method is then applied to a real test case (probing form error for a commercial fringe projection system), where the efficiently determined MPE is overestimated by 0.3% with respect to an MPE determined using an arbitrarily chosen large number of measurements.

Ahmad Satya Wicaksana et al 2024 Meas. Sci. Technol. 35 095016

Siqi Gong et al 2024 Meas. Sci. Technol. 35 106128

Data-driven intelligent fault diagnosis methods generally require a large amount of labeled data and considerable time to train network models. However, obtaining sufficient labeled data in practical industrial scenarios has always been a challenge, which hinders the practical application of data-driven methods. A digital twin (DT) model of rolling bearings can generate labeled training dataset for various bearing faults, supplementing the limited measured data. This paper proposes a novel DT-assisted approach to address the issue of limited measured data for bearing fault diagnosis. First, a dynamic model of bearing with damages is introduced to generate simulated bearing acceleration vibration signals. A DT model is constructed in Simulink, where the model parameters are updated based on the actual system behavior. Second, the structural parameters of the DT model are adaptively updated using least squares method with the measured data. Third, a Vision Transformer (ViT) -based network, integrated with short-time Fourier transform, is developed to achieve accurate fault diagnosis. By applying short-time Fourier transform at the input end of the ViT network, the model effectively extracts additional information from the vibration signals. Pre-training the network with an extensive dataset from miscellaneous tasks enables the acquisition of pre-trained weights, which are subsequently transferred to the bearing fault diagnosis task. Experiments results verify that the proposed approach can achieve higher diagnostic accuracy and better stability.

Gustavo Quino et al 2021 Meas. Sci. Technol. 32 015203

Digital image correlation (DIC) is a widely used technique in experimental mechanics for full field measurement of displacements and strains. The subset matching based DIC requires surfaces containing a random pattern. Even though there are several techniques to create random speckle patterns, their applicability is still limited. For instance, traditional methods such as airbrush painting are not suitable in the following challenging scenarios: (i) when time available to produce the speckle pattern is limited and (ii) when dynamic loading conditions trigger peeling of the pattern. The development and application of some novel techniques to address these situations is presented in this paper. The developed techniques make use of commercially available materials such as temporary tattoo paper, adhesives and stamp kits. The presented techniques are shown to be quick, repeatable, consistent and stable even under impact loads and large deformations. Additionally, they offer the possibility to optimise and customise the speckle pattern. The speckling techniques presented in the paper are also versatile and can be quickly applied in a variety of materials.

A Sciacchitano 2019 Meas. Sci. Technol. 30 092001

Particle image velocimetry (PIV) has become the chief experimental technique for velocity field measurements in fluid flows. The technique yields quantitative visualizations of the instantaneous flow patterns, which are typically used to support the development of phenomenological models for complex flows or for validation of numerical simulations. However, due to the complex relationship between measurement errors and experimental parameters, the quantification of the PIV uncertainty is far from being a trivial task and has often relied upon subjective considerations. Recognizing the importance of methodologies for the objective and reliable uncertainty quantification (UQ) of experimental data, several PIV-UQ approaches have been proposed in recent years that aim at the determination of objective uncertainty bounds in PIV measurements.

This topical review on PIV uncertainty quantification aims to provide the reader with an overview of error sources in PIV measurements and to inform them of the most up-to-date approaches for PIV uncertainty quantification and propagation. The paper first introduces the general definitions and classifications of measurement errors and uncertainties, following the guidelines of the International Organization for Standards (ISO) and of renowned books on the topic. Details on the main PIV error sources are given, considering the entire measurement chain from timing and synchronization of the data acquisition system, to illumination, mechanical properties of the tracer particles, imaging of those, analysis of the particle motion, data validation and reduction. The focus is on planar PIV experiments for the measurement of two- or three-component velocity fields.

Approaches for the quantification of the uncertainty of PIV data are discussed. Those are divided into a-priori UQ approaches , which provide a general figure for the uncertainty of PIV measurements, and a-posteriori UQ approaches , which are data-based and aim at quantifying the uncertainty of specific sets of data. The findings of a-priori PIV-UQ based on theoretical modelling of the measurement chain as well as on numerical or experimental assessments are discussed. The most up-to-date approaches for a-posteriori PIV-UQ are introduced, highlighting their capabilities and limitations.

As many PIV experiments aim at determining flow properties derived from the velocity fields (e.g. vorticity, time-average velocity, Reynolds stresses, pressure), the topic of PIV uncertainty propagation is tackled considering the recent investigations based on Taylor series and Monte Carlo methods. Finally, the uncertainty quantification of 3D velocity measurements by volumetric approaches (tomographic PIV and Lagrangian particle tracking) is discussed.

Martin Kögler and Bryan Heilala 2020 Meas. Sci. Technol. 32 012002

Time-gated (TG) Raman spectroscopy (RS) has been shown to be an effective technical solution for the major problem whereby sample-induced fluorescence masks the Raman signal during spectral detection. Technical methods of fluorescence rejection have come a long way since the early implementations of large and expensive laboratory equipment, such as the optical Kerr gate. Today, more affordable small sized options are available. These improvements are largely due to advances in the production of spectroscopic and electronic components, leading to the reduction of device complexity and costs. An integral part of TG Raman spectroscopy is the temporally precise synchronization (picosecond range) between the pulsed laser excitation source and the sensitive and fast detector. The detector is able to collect the Raman signal during the short laser pulses, while fluorescence emission, which has a longer delay, is rejected during the detector dead-time. TG Raman is also resistant against ambient light as well as thermal emissions, due to its short measurement duty cycle.

In recent years, the focus in the study of ultra-sensitive and fast detectors has been on gated and intensified charge coupled devices (ICCDs), or on CMOS single-photon avalanche diode (SPAD) arrays, which are also suitable for performing TG RS. SPAD arrays have the advantage of being even more sensitive, with better temporal resolution compared to gated CCDs, and without the requirement for excessive detector cooling. This review aims to provide an overview of TG Raman from early to recent developments, its applications and extensions.

Latest articles

Meixuan Su et al 2024 Meas. Sci. Technol. 35 115305

As an efficient and environment-friendly method, electrostatic separation has gradually replaced flotation methods in the separation of magnesite in recent years. In the process of triboelectrostatic separation, the mineral particles are tribocharged driven by the air flow, then the trajectory is shifted under the action of the electric field, so as to realize the separation. The useful mineral in magnesite is MgCO 3 , but the theoretical research related to the charge characteristics of MgCO 3 is not sufficient. Particle image velocimetry (PIV), as an indirect measurement technique, is able to obtain the velocity field of the fluids from images. However, the particles moving in the air have the issues such as excessive speed and small particle size, which make the traditional PIV has low accuracy in estimating the motion of particles. In this paper, a high-speed camera is used to capture the motion trajectory of tribocharged MgCO 3 particles in a parallel electric field. A new optical flow method LFN-en-A network based on LiteFlowNet-en network is proposed to compute the particle motion trajectory by combining the deep learning method with the traditional PIV, which realizes the displacement estimation of particles moving in the air. It ultimately realizes the calculation of the charge-to-mass ratio on single particles. Analyzing the accuracy of the LFN-en-A network's estimation in the experiments, the estimation of LiteFlowNet-en was compared. Changing the shooting frame rate analyzes the optimal one required by the LFN-en-A network. Combining the estimation results of LFN-en-A to calculate the particle charge-to-mass ratio ( Q/m ), the Q / m of MgCO 3 particle was analyzed by changing the experimental conditions in the process of particles' tribocharging, which provided a new method for particle-to-charge ratio measurement.

Yan Zhang et al 2024 Meas. Sci. Technol. 35 116202

Batch processes play an important role in modern chemical industrial and manufacturing production, while the control of product quality relies largely on online quality prediction. However, the complex nonlinearity of batch process and the dispersion of quality-related features may affect the quality prediction performance. In this paper, a deep quality-related stacked isomorphic autoencoder for batch process quality prediction is proposed. Firstly, the raw input data are reconstructed layer-by-layer by isomorphic autoencoder and the raw data features are obtained. Secondly, the quality-related information is enhanced by analyzing the correlation between the isomorphic feature of each layer of the network and the output target, and constructing a correlation loss function. Thirdly, a deep quality-related prediction model is constructed to predict the batch process quality variables. Finally, experimental validation was carried out in penicillin fermentation simulation platform and strip hot rolling process, and the experimental results demonstrated the feasibility and effectiveness of the model proposed in this paper for the quality prediction of the batch process.

Huachuan Zhao et al 2024 Meas. Sci. Technol. 35 116306

During ship operations at sea, the vessel's attitude undergoes continuous changes due to various factors such as wind, waves, and its own motion. These influences are challenging to mathematically describe, and the changes in attitude are also influenced by multiple interconnected factors. Consequently, accurately predicting the ship's attitude presents significant challenges. Previous studies have demonstrated that phenomena like wind speed and wave patterns exhibit chaotic characteristics when affecting attitude changes. However, research on predicting ship attitudes lacks an exploration of whether chaotic characteristics exist and how they can be described and applied. This paper initially identifies the chaotic characteristics of ship attitude data through phase space reconstruction analysis and provides mathematical representations for them. Based on these identified chaotic characteristics, a Transformer model incorporating feature embedding layers is employed for time series prediction. Finally, a comparison with traditional methods validates the superiority of our proposed approach.

Libing Du et al 2024 Meas. Sci. Technol. 35 115602

Particle morphology is an important factor affecting the mechanical properties of granular materials. However, it is difficult to quantify the morphology characteristics of the complex concave particle. Fortunately, complex particle can be segmented by convex decomposition, so a new shape index named convex decomposition coefficient (CDC) related to the number of segmentations is proposed. First, the pocket concavity was introduced to simplify the morphology hierarchically. Second, the cut weight linked to concavity was defined and convex decomposition was linearly optimised by maximizing the total cut weights. Third, the CDC was defined as the minimum block number where the block area ratio cumulatively exceeded 0.9 in descending order. Finally, the proposed index was used to quantify the particle morphology of coral sand. The results demonstrate that the CDC of coral sands mainly ranges from 2 to 6, with a positively skewed distribution. Furthermore, CDC correlates well with three shape indices: sphericity, particle size, and convexity. Larger CDC is associated with smaller sphericity, larger particle size, and smaller convexity. The index has certain scientific research value and practical significance.

Zhenfa Shao et al 2024 Meas. Sci. Technol. 35 116111

In practical scenarios, gearbox fault diagnosis faces the challenge of extremely scarce labeled data. Additionally, variations in operating conditions and differences in sensor installations exacerbate data distribution shifts, significantly increasing the difficulty of fault diagnosis. To address the above issues, this paper proposes a wavelet dynamic joint self-adaptive network guided by a pseudo-label alignment mechanism (MDJSN-DFL). First, the wavelet-efficient convolution module is designed based on wavelet convolution and efficient attention mechanisms. This module is used to construct a multi-wavelet convolution feature extractor to extract critical fault features at multiple levels. Secondly, to improve the classifier's discriminability in the target domain, a transitional clustering-guided DFL is developed. This mechanism can capture fuzzy classification samples and improve the pseudo-label quality of the target domain. Finally, a dynamic joint mean square difference algorithm (DJSD) is proposed, which is composed of joint maximum mean square discrepancy and joint maximum mean discrepancy. The algorithm can adaptively adjust according to the dynamic balance factor to minimize the domain distribution discrepancy. Experiments on two different gearbox datasets show that MDJSN-DFL performs better in diagnostic scenarios under varying load conditions and different sensor installation setups, validating the proposed method's effectiveness and superiority.

Review articles

Jiashuai Huang et al 2024 Meas. Sci. Technol. 35 112001

With the continuous development of the aerospace, defense, and military industry, along with other high-end fields, the complexity of machined parts has gradually increased. Consequently, the demand for tool intelligence has also strengthened. However, traditional tools are prone to wear during cutting due to high cutting forces, high temperatures, and vibrations. Intelligent tools, in contrast to traditional ones, integrate sensors into their design, allowing for real-time monitoring of the cutting status and timely prediction of tool wear. The application of intelligent tools in machining significantly enhances machining quality, increases productivity, and reduces production costs. In this review, first, the tool wear monitoring methods were classified and discussed. Second, the intelligence and innovation of sensors in monitoring cutting force, temperature, and vibration were introduced, and the commonly used types of sensors for online monitoring of cutting force were detailed. Furthermore, different types of sensors in tool wear were discussed, and the advantages of multi-sensor monitoring were summarized. Some urgent issues and perspectives that need to be addressed were proposed, providing new ideas for the design and development of intelligent tools.

Dang Tuyet Minh and Nguyen Ba Dung 2024 Meas. Sci. Technol. 35 112002

Path planning for unmanned aerial vehicle (UAV) is the process of determining the path that travels through each location of interest within a particular area. There are numerous algorithms proposed and described in the publications to address UAV path planning problems. However, in order to handle the complex and dynamic environment with different obstacles, it is critical to utilize the proper fusion algorithms in planning the UAV path. This paper reviews some hybrid algorithms used in finding the optimal route of UAVs that developed in the last ten years as well as their advantages and disadvantages. The UAV path planning methods were classified into categories of hybrid algorithms based on traditional, heuristic, machine learning approaches. Criteria used to evaluate algorithms include execution time, total cost, energy consumption, robustness, data, computation, obstacle avoidance, and environment. The results of this study provide reference resources for researchers in finding the path for UAVs.

Qi Wang et al 2024 Meas. Sci. Technol. 35 102001

With the booming development of modern industrial technology, rotating machinery fault diagnosis is of great significance to improve the safety, efficiency and sustainable development of industrial production. Machine learning as an effective solution for fault identification, has advantages over traditional fault diagnosis solutions in processing complex data, achieving automation and intelligence, adapting to different fault types, and continuously optimizing. It has high application value and broad development prospects in the field of fault diagnosis of rotating machinery. Therefore, this article reviews machine learning and its applications in intelligent fault diagnosis technology and covers advanced topics in emerging deep learning techniques and optimization methods. Firstly, this article briefly introduces the theories of several main machine learning methods, including Extreme Learning Machines (ELM), Support Vector Machines (SVM), Convolutional Neural Networks (CNNs), Deep Belief Networks (DBNs) and related emerging deep learning technologies such as Transformer, adversarial neural network (GAN) and graph neural network (GNN) in recent years. The optimization techniques for diagnosing faults in rotating machinery are subsequently investigated. Then, a brief introduction is given to the papers on the application of these machine learning methods in the field of rotating machinery fault diagnosis, and the application characteristics of various methods are summarized. Finally, this survey discusses the problems to be solved by machine learning in fault diagnosis of rotating machinery and proposes an outlook.

Liuyang Song et al 2024 Meas. Sci. Technol. 35 092003

This paper presents a comprehensive review of the state-of-the-art techniques for predicting the remaining useful life (RUL) of rolling bearings. Four key aspects of bearing RUL prediction are considered: data acquiring, construction of health indicators, development of RUL prediction algorithms, and evaluation of prediction results. Additionally, publicly available datasets that can be used to validate bearing prediction algorithms are described. The existing RUL prediction algorithms are categorized into three types and have been comprehensively reviewed: physical-based, statistical-based, and data-driven. In particular, the progress made in data-driven prediction methods is summarized, and typical methods such as rerrent neural network, convolutional network, graph convolutional network, Transformer, and transfer learning-based methods are introduced in detail. Finally, the challenges faced by data-driven methods in RUL prediction for bearings are discussed.

Mohanraj T et al 2024 Meas. Sci. Technol. 35 092002

Milling is an extremely adaptable process that can be utilized to fabricate a wide range of shapes and intricate 3D geometries. The versatility of the milling process renders it useful for the production of a diverse range of components and products in several industries, including aerospace, automotive, electronics, and medical equipment. Monitoring tool conditions is essential for maintaining product quality, minimizing production downtime, and maximizing tool life. Advances in this field have been driven by the need for increased productivity, reduced tool wear, and improved process efficiency. Tool condition monitoring (TCM) in the milling process is a critical aspect of machining operations. TCM involves assessing the health and performance of cutting tools used in milling machines. As technology evolves, staying updated with the latest developments in this field is essential for manufacturers seeking to optimize their milling operations. However, addressing the challenges associated with sensor integration, data analysis, and cost-effectiveness remains crucial. To fill this research gap, this paper provides an overview of the extensive literature on monitoring milling tool conditions. It summarizes the key focus areas, including tool wear sensors and the application of various machine learning and deep learning algorithms. It also discusses the potential applications of TCM beyond wear detection, such as predicting tool breakage, tool wear, the cutting tool's remaining lifetime, and the challenges faced by TCMs. This review also provides suggestions for potential future research endeavors and is anticipated to offer valuable insights for the development of advanced TCMs in terms of tool wear monitoring and predicting remaining useful life.

Accepted manuscripts

Gu et al 

Damage to the composite propeller blades could lead to rotational imbalance, which seriously affects the operational safety of unmanned aerial vehicles (UAVs), therefore, a novel method combining the Teager energy operator and bidirectional temporal convolutional network is proposed for detecting, localizing, and quantifying the damage-related imbalance in the blades. A flexible sensing system that contains MEMS accelerometers, signal conditioning, and wireless transmission is integrated with the composite propeller for in-situ signal acquisition of the propeller blades. Teager energy operator (TEO) is applied to demodulate and enhance the pulse compositions in vibration signals and singular value decomposition (SVD) is employed to suppress random noise, resulting in denoised Teager energy spectrums for model input. Temporal convolutional network (TCN) has been widely used in sequence signal modeling because the causal dilated convolution could learn the context information of sequence signals while maintaining the advantages of parallel computing. To fully extract the signal features, bidirectional temporal convolutional network (BiTCN) models are established to learn both the forward and backward signal features. Experimental verification results show that the proposed method detects the existence of imbalance with 100% accuracy, and the accuracies of localization and quantization are 99.65% and 98.61%, respectively, which are much higher than those of the models with the original signal as input. In addition, compared with the other four different algorithms, BiTCN is superior in terms of convergence speed and prediction accuracy.

Zhao et al 

AbstractFlow visualization in harsh environments such as a scramjet combustor featuring highly turbulentsupersonic reactive flow with intensive luminescence emission is chanllenging and typically lack ofsufficient spatiotemporal resolution that is essential for resolving flow dynamics. This study presents adevelopment of a robust flow visualization technique with an exceptional spatiotemporal resolution in ascramjet combustor. By utilizing a customized LED light source, the short pulse duration along with ahigh peak power and a single-color emission ensures an instantaneous exposure with little backgroundluminescence interference. Focusing schlieren image measurements with mitigated path-integrationeffect are successfully demonstrated in a scramjet engine combustor at instant, frame-straddling, andsequential temporal resolutions of 100 ns, 500 ns, and 26 µs, respectively, along with a megapixelimaging resolution. Consequently, in addition to flow visualization, it is worth highlighting that theexceptional spatiotemporal correlation resolved by present measurements exhibits attractive potentialsof velocimetry for harsh high-speed flow environments.&#xD;

Zhou et al 

Laser absorption tomography (LAT) has been widely employed to capture two/three-dimensional reactive flow-field parameters with a penetrating spatiotemporal resolution. In industrial environments, LAT is generally implemented by measuring multiple, e.g. 30 to more than 100, wavelength modulated laser transmissions at high imaging rates, e.g. tens to thousands of frames per second (fps). A short-period LAT experiment can generate extensive load of data, which require massive computational source and time for data post-processing. In this work, a large-scale data processing platform is designed for industrial LAT. The platform significantly speeds up LAT signal processing by introducing a parallel computing architecture. By identifying the discrepancy between the measured and theoretical spectra, the new platform enables indexing of the laser-beam measurements that are disturbed by harsh-environment noise. Such a scheme facilitates effective removal of noise-distorted beams, which can lead to artefacts in the reconstructed images. The designed platform is validated by a lab-based LAT experiment, which is implemented by processing the laser transmissions of a 32-beam LAT sensor working at 250 fps. To process a 60-second LAT experimental dataset, the parallelism enabled by the platform saves computational time by 40.12% compared to the traditional single-thread approach. The error-detection scheme enables the successful accurate identification of noise-distorted measurements, i.e. 0.59% of overall laser-beam measurements that fall out of the physical model.

Lu et al 

Arresters are one of the critical components of the power system.&#xD;However, due to the arrester's regular and uniform umbrella skirt, both traditional&#xD;manual detection methods and existing computer vision approaches exhibit limitations&#xD;in accuracy and efficiency. This paper proposes an automatic, robust, efficient arrester&#xD;point cloud registration method to address this problem. First, a robotic arm&#xD;maneuvers a depth camera to capture point cloud data from various perspectives.&#xD;Then, the fast global registration (FGR) point cloud coarse registration method&#xD;based on the signature of histograms of orientations (SHOT) descriptor to produce&#xD;preliminary registration results. This result is ultimately used as the initial value of&#xD;the improved iterative closest point (ICP) algorithm to refine the registration further.&#xD;Experimental results on various data sets collected from arrester and public data sets&#xD;show that the algorithm's root mean square error(RMSE) is less than 0.1mm, meeting&#xD;the requirements of the engineering application of arrester detection.

Xu et al 

Because of the "soft-field" effect and ill-posed and ill-conditional inverse problem, it is difficult to obtain high quality images from an electrical capacitance tomography (ECT) system. To achieve high quality images and fast imaging speed with limited measurement data, an image reconstruction algorithm, which was initially proposed for compressive sensing, is modified for reconstructing ECT images. In this proposed algorithm, deep networks were inspired by the iterative shrinkage-thresholding algorithm (ISTA) to achieve a mathematically interpretable model with learnable parameters. On this basis, the traditional Landweber iteration is combined with ISTA-Net to optimize ECT image reconstruction. The training and test process is driven by the dynamic simulation coupling the gas-oil two-phase flow field and ECT electrostatic field. Test results demonstrate that this algorithm is superior to the traditional image reconstruction algorithms for ECT. Compared with the LBP algorithm, the averaged image error and gas fraction error have dropped 20.44% and 16.74% respectively, while the computational speed is similar to Landweber iteration. The reconstruction accuracy of two-phase interface and gas fraction in this new algorithm has been validated by static experimental test, showing that it is promising to be applied in real gas-oil two phase flow measurement.&#xD;

More Accepted manuscripts

Open access

Kenneth M Peterson et al 2024 Meas. Sci. Technol. 35 115601

Surface characteristics are a major contributor to the in-service performance, particularly fatigue life, of additively manufactured (AM) components. Centrifugal disk finishing (CDF) is one of many rigid media, abrasive machining processes employed to smooth the surfaces and edges of AM components. Within the general family of abrasive machining processes currently applied to AM, CDF is moderate in terms of material removal rate and the inertial forces exerted. How CDF alters the underlying microstructure of the processed surface is currently unknown. Here, white light profilometry and high-energy x-ray diffraction are employed to characterize surface finish, crystallographic texture, and anisotropic distributions of residual microscale strain as a function of depth in CDF-finished Inconel 718 manufactured with laser powder bed fusion. Surfaces are finished using both unimodal and bimodal finishing media size distributions. The CDF processes employed are found to remove surface crystallographic textures (here a {111} fiber texture) from AM components, but generally not alter the bulk texture (here a cube texture). CDF is also found to impart significant amounts of residual microscale strain into the first 100 μm from the sample surface.

Eberhard Manske et al 2024 Meas. Sci. Technol. 35 110201

Minqiu Zhou et al 2024 Meas. Sci. Technol.

Dustin Witkowski et al 2024 Meas. Sci. Technol.

Lifetime-based phosphor thermometry has been applied in a wide variety of surface temperature measurement applications due to its relative ease of implementation and robustness to background interference when compared to other optical temperature measurement methods. It is often assumed that the technique is minimally intrusive if the thickness of the applied phosphor coating is < 20 µm. To evaluate this assumption, high-speed phosphor surface temperature and thermocouple measurements were performed on 4140 steel substrates installed in the cylinder head of an optically-accessible internal-combustion engine for four operating conditions. For phosphor thermometry measurements, four substrates were studied, each coated with a phosphor layer of different thickness ranging between 6 µm and 47 µm. The phosphor thermometry measured temperature swings during combustion were shown to be heavily impacted by the presence of the phosphor coatings, increasing by roughly a factor of 2 - 2.5 when increasing the thickness from 6 µm to 30 µm. A technique was implemented which utilizes the temperature data in combination with a heat conduction model to provide estimates of the temperature swing and heat transfer flux in the absence of the phosphor coating. It was shown that even the 6 µm phosphor coating could lead to an order of magnitude increase in the temperature swing relative to the uncoated 4140 steel substrate. Despite the intrusiveness of phosphor thermometry for the surface temperature measurements, reasonable agreement was demonstrated between heat flux estimates determined with the heat transfer modeling technique and those deduced from temperature-swing measurements using two different high-speed thermocouples. The results indicate that phosphor surface thermometry can be a reliable surface temperature and heat flux diagnostic for transient high heat flux environments, as long as proper care is taken to account for the impact of the phosphor layer on the measurement.&#xD;

Ryan Thomas et al 2024 Meas. Sci. Technol. 35 116106

This paper presents machine learning classification on simulated data of permeable conducting spheres in air and seawater irradiated by low frequency electromagnetic pulses. Classification accuracy greater than 90% was achieved. The simulated data were generated using an analytical model of a magnetic dipole in air and seawater placed 1.5–3.5 m above the center of the sphere in 50 cm increments. The spheres had radii of 40 cm and 50 cm and were of permeable materials, such as steel, and non-permeable materials, such as aluminum. A series RL circuit was analytically modeled as the transmitter coil, and an RLC circuit as the receiver coil. Additive white Gaussian noise was added to the simulated data to test the robustness of the machine learning algorithms to noise. Multiple machine learning algorithms were used for classification including a perceptron and multiclass logistic regression, which are linear models, and a neural network, 1D convolutional neural network (CNN), and 2D CNN, which are nonlinear models. Feature maps are plotted for the CNNs and provide explainability of the salient parts of the time signature and spectrogram data used for classification. The pulses investigated, which expand the literature, include a two-sided decaying exponential, Heaviside step-off, triangular, Gaussian, rectangular, modulated Gaussian, raised cosine, and rectangular down-chirp. Propagation effects, including dispersion and frequency dependent attenuation, are encapsulated by the analytical model, which was verified using finite element modeling. The results in this paper show that machine learning methods are a viable alternative to inversion of electromagnetic induction (EMI) data for metallic sphere classification, with the advantage of real-time classification without the use of a physics-based model. The nonlinear machine learning algorithms used in this work were able to accurately classify metallic spheres in seawater even with significant pulse distortion caused by dispersion and frequency dependent attenuation. This paper presents the first effort towards the use of machine learning to classify metallic objects in seawater based on EMI sensing.

Atharva Hans et al 2024 Meas. Sci. Technol. 35 116002

Measuring particles' three-dimensional (3D) positions using multi-camera images in fluid dynamics is critical for resolving spatiotemporally complex flows like turbulence and mixing. However, current methods are prone to errors due to camera noise, optical configuration and experimental setup limitations, and high seeding density, which compound to create fake measurements (ghost particles) and add noise and error to velocity estimations. We introduce a Bayesian volumetric reconstruction (BVR) method, addressing these challenges by using probability theory to estimate uncertainties in particle position predictions. Our method assumes a uniform distribution of particles within the reconstruction volume and employs a model mapping particle positions to observed camera images. We utilize variational inference with a modified loss function to determine the posterior distribution over particle positions. Key features include a penalty term to reduce ghost particles, provision of uncertainty bounds, and scalability through subsampling. In tests with synthetic data and four cameras, BVR achieved 95% accuracy with less than 3% ghost particles and an RMS error under 0.3 pixels at a density of 0.1 particles per pixel. In an experimental Poiseuille flow measurement, our method closely matched the theoretical solution. Additionally, in a complex cerebral aneurysm basilar tip geometry flow experiment, our reconstructions were dense and consistent with observed flow patterns. Our BVR method effectively reconstructs particle positions in complex 3D flows, particularly in situations with high particle image densities and camera distortions. It distinguishes itself by providing quantifiable uncertainty estimates and scaling efficiently for larger image dimensions, making it applicable across a range of fluid flow scenarios.

Frank J van Kann and Alexey V Veryaskin 2024 Meas. Sci. Technol. 35 115101

A novel room temperature capacitive sensor interface circuit is proposed and successfully tested, which uses a modified all-pass filter (APF) architecture combined with a simple series resonant tank circuit with a moderate Q -factor. It is fashioned from a discrete inductor with small dissipation resonating with a grounded capacitor acting as the sensing element to obtain a resolution of Δ C ∼ 2 zF in a capacitance range of 10–30 pF. The circuit converts the change in capacitance to the change in the phase of a carrier signal in a frequency range with a central frequency set up by the tank circuit's resonant frequency and is configured to act as a close approximation of the ideal APF. This cancels out the effects of amplitude modulation when the carrier signal is imperfectly tuned to the resonance. The proposed capacitive sensor interface has been specifically developed for use as a front-end constituent in ultra-precision mechanical displacement measurement systems, such as accelerometers, seismometers, gravimeters and gravity gradiometers, where moving plate grounded air gap capacitors are frequently used. Some other applications of the proposed circuit are possible including the measurement of the electric field, where the sensing capacitor depends on the applied electric field, and cost effective capacitive gas sensors. In addition, the circuit can be easily adapted to function with very small capacitance values (1–2 pF) as is typical in MEMS-based transducers.

K F A Jorissen et al 2024 Meas. Sci. Technol. 35 115501

We present the study of millisecond-resolved polymer brush swelling dynamics using infrared spectroscopy with a home-built quantum cascade laser-based infrared spectrometer at a 1 kHz sampling rate after averaging. By cycling the humidity of the environment of the polymer brush, we are able to measure the swelling dynamics sequentially at different wavenumbers. The high sampling rate provides us with information on the reconformation of the brush at a higher temporal resolution than previously reported. Using spectroscopic ellipsometry, we study the brush swelling dynamics as a reference experiment and to correct artefacts of the infrared measurement approach. This technique informs on the changes in the brush thickness and refractive index. Our results indicate that the swelling dynamics of the polymer brush are poorly described by Fickian diffusion, pointing toward more complicated underlying transport.

Jing Guo et al 2024 Meas. Sci. Technol. 35 105119

Electrical impedance tomography (EIT) has become an integral component in the repertoire of medical imaging techniques, particularly due to its non-invasive nature and real-time imaging capabilities. Despite its potential, the application of EIT in minimally invasive surgery (MIS) has been hindered by a lack of specialized electrode probes. Existing designs often compromise between invasiveness and spatial sensitivity: probes small enough for MIS often fail to provide detailed imaging, while those offering greater sensitivity are impractically large for use through a surgical trocar. Addressing this challenge, our study presents a breakthrough in EIT probe design. The open electrode probe we have developed features a line of 16 electrodes, thoughtfully arrayed to balance the spatial demands of MIS with the need for precise imaging. Employing an advanced EIT reconstruction algorithm, our probe not only captures images that reflect the electrical characteristics of the tissues but also ensures the homogeneity of the test material is accurately represented. The versatility of our probe is demonstrated by its capacity to generate high-resolution images of subsurface anatomical structures, a feature particularly valuable during MIS where direct visual access is limited. Surgeons can rely on intraoperative EIT imaging to inform their navigation of complex anatomical landscapes, enhancing both the safety and efficacy of their procedures. Through rigorous experimental validation using ex vivo tissue phantoms, we have established the probe's proficiency. The experiments confirmed the system's high sensitivity and precision, particularly in the critical tasks of subsurface tissue detection and surgical margin delineation. These capabilities manifest the potential of our probe to revolutionize the field of surgical imaging, providing a previously unattainable level of detail and assurance in MIS procedures.

Antonella D'Alessandro et al 2024 Meas. Sci. Technol. 35 105116

Civil constructions significantly contribute to greenhouse gas emissions and entail extensive energy and resource consumption, leading to a substantial ecological footprint. Research into eco-friendly engineering solutions is therefore currently imperative, particularly to mitigate the impact of concrete technology. Among potential alternatives, shot-earth-concrete, which combines cement and earth as a binder matrix and is applied via spraying, emerges as a promising option. Furthermore, this composite material allows for the incorporation of nano and micro-fillers, thereby providing room for enhancing mechanical properties and providing multifunctional capabilities. This paper investigates the damage detection capabilities of a novel smart shot-earth concrete with carbon microfibers, by investigating the strain sensing performance of a full-scale vault with a span of 4 m, mechanically tested until failure. The material's strain and damage sensing capabilities involve its capacity to produce an electrical response (manifested as a relative change in resistance) corresponding to the applied strain in its uncracked state, as well as to exhibit a significant alteration in electrical resistance upon cracking. A detailed multiphysics numerical (i.e. mechanical and electrical) model is also developed to aid the interpretation of the experimental results. The experimental test was conducted by the application of an increasing vertical load at a quarter of the span, while modelling of the element was carried out by considering a piezoresistive material, with coupled mechanical and electrical constitutive properties, including a new law to reproduce the degradation of the electrical conductivity with tensile cracking. Another notable aspect of the simulation was the consideration of the effects of the electrical conduction through the rebars, which was found critical to accurately reproduce the full-scale electromechanical response of the vault. By correlating the outcomes from external displacement transducers with the self-monitoring features inherent in the proposed material, significant insights were gleaned. The findings indicated that the proposed smart-earth composite, besides being well suited for structural applications, also exhibits a distinctive electromechanical behavior that enables the early detection of damage initiation. The results of the paper represent an important step toward the real application of smart earth-concrete in the construction field, demonstrating the effectiveness and feasibility of full-scale strain and damage monitoring even in the presence of steel reinforcement.

More Open Access articles

Journal links

  • Submit an article
  • About the journal
  • Editorial Board
  • Author guidelines
  • Review for this journal
  • Publication charges
  • News and editorial
  • Journal collections
  • Pricing and ordering

Journal information

  • 1990-present Measurement Science and Technology doi: 10.1088/issn.0957-0233 Online ISSN: 1361-6501 Print ISSN: 0957-0233

Journal history

  • 1990-present Measurement Science and Technology
  • 1968-1989 Journal of Physics E: Scientific Instruments
  • 1923-1967 Journal of Scientific Instruments

Geektonight

What is Measurement? Scales, Types, Criteria and Developing Measurement Tools

  • Post last modified: 21 August 2023
  • Reading time: 14 mins read
  • Post category: Research Methodology

a measurement in research

What is Measurement?

Measurement is the process of assigning numbers or values to quantities or attributes of objects or events according to certain rules or standards. It is a fundamental aspect of science, mathematics, and everyday life, allowing us to quantify and compare various aspects of the physical world.

Table of Content

  • 1 What is Measurement?
  • 2.1 Nominal Scale
  • 2.2 Ordinal Scale
  • 2.3 Interval Scale
  • 2.4 Ratio Scale
  • 3.1 Concept Development
  • 3.2 Specification of Concept Dimension
  • 3.3 Selection of Indicators
  • 3.4 Formation of Index
  • 4.1 Reliability
  • 4.2 Validity

In simple words, measurement means using a yardstick to determine the characteristics of a physical object. In addition to physical objects, qualitative concepts, such as songs and paintings, or an abstract phenomenon, can also be measured. However, measuring qualitative concepts is a comparatively difficult task because numbers cannot be easily assigned to them. For example, it is easy to state that the weight of an object or a subject is 10 kg.

However, if a person is asked to measure a song for its good composition, then it becomes difficult to say that the song is 10 per cent good or so. Today, there exist standardised tools to measure abstract concepts such as intelligence, unity, honesty, bravery, success and stress. High accuracy and confidence can be expected while measuring quantitative characteristics of an object.

Measurement Scales

A measurement scale refers to a classification that defines the nature of information within the numerals assigned to variables. Measurement scales have been classified into four types as shown in Figure:

Let us now discuss the types of measurement scales in detail.

Nominal Scale

In this scale, the variables are named or labelled in no specific order. This is the measurement scale in which numbers are assigned to things, beings or events to classify or identify or label them. All of the nominal scales are mutually exclusive and bear no numerical significance. For example, the assignment of different numbers (1, 2, 3, …) to cricket players in a team, books in a library, and computers in the Internet café. These numbers cannot be used to perform mathematical operations.

If 11 players of a cricket team are assigned numbers from 1 to 11, finding the average of 1 to 11 does not signify any meaning. In this case, the only use of these numbers is to count the team members. The nominal scale represents the lowest level of measurement. However, it is helpful when there is a need to classify data. For example, for a question “What are your political views?”, we can have the following nominal scale:

S. No.Objective
1Left Orientation
2Right Orientation
3Centre
4Conservative

Ordinal Scale

This is the scale that only implies greater than or less than but does not answer how much greater or less. Only inequalities can be set up with respect to ordinal scale and other arithmetic operations cannot be performed. The ordinal scale can be used to make only comparisons. In ordinal scale, data is shown in order of magnitude. An example of the ordinal scale is as follows:

For example, the ordering of colour preferences of Mr. A are:

1Silver
2White
3Black
4Red

Thus, with the ordinal scale, researcher can use median or mode to determine the central tendency of a set of ordinal data.

Interval Scale

This is the scale in which the interval between successive positions is equal. The positions are separated by equally spaced intervals or basis. For example, a person represents his/her level of happiness along a scale rated from 1 to 10.

With the interval scale, the following conclusions can be made:

  • Number 1 represents least happy and number 10 represents most happy.
  • Number 6 represents a higher level of happiness than number 5.
  • The difference in the level of happiness between 6 and 5 is same as the difference in the level of happiness between 8 and 9.
  • However, a major problem with the interval scale is that the ratio of two observations cannot be taken. It cannot be stated that number 4 represents double happiness level as compared to number 2.

The basic limitation of the interval scale is that it does not contain an absolute zero. A simple example of the interval scale is the scale of temperature in which absolute zero is unattainable theoretically. Therefore, the interval scale does not have the provision to measure the absence of any characteristic such as zero happiness (or absence of happiness). The interval scale contains features of nominal and ordinal scales. In addition, it involves the concept of equality of intervals. In the interval scale, more arithmetic operations, such as addition and subtraction can be performed. Mean can be calculated for interval scale.

Ratio Scale

This is the scale that contains absolute or true zero, which implies the absence of any trait. For example, on a centimetre scale, zero implies the absence of length or height. In the ratio scale, it is possible to take ratio of two observations. For example, it can be stated that the weight of Ram is twice that of Shyam. The ratio scale is the most powerful scale of measurement, as almost all statistical operations, which cannot be performed by other scales, can be performed by it.

Developing Measurement Tools

Any tool used to measure or collect data is called measurement tool, which is also known as assessment tool. There are several types of measurement tools, such as observations, scales, questionnaires, surveys, interviews and indexes (or indices). There are four stages of developing measurement tools, which are explained as follows:

Concept Development

This is the first stage in the process of development of measuring tools. At this stage, the researcher develops a good understanding of the topic he/she wants related to his research study. For example, research on the pros and cons of the multiparty political system requires a proper understanding of the concept behind this system.

Without referring to the theories related to the multi-party system, a good understanding about this system cannot be developed. However, if the research is being done on a concept such as stress (which has already been researched extensively), no exclusive concept building is required.

Specification of Concept Dimension

After developing a concept at the first stage, the researcher is required to clearly identify the dimensions of the concept. For example, when a researcher wants to conduct a study related to image of a company, he/she may relate image of the company to dimensions or factors such as customer service level, customer treatment, product quality, employee treatment, social responsibility, corporate leadership, etc.

Selection of Indicators

At this stage, indicators for the research subject are selected. Indicators help in measuring the elements of a concept such as knowledge, opinion, choices, expectations and feelings of respondents. Examples of indicators are variables.

For example, the effectiveness of a medicine (concept) used for treating a chronic disease is related to indicators such as changes in the mortality rate, recurrence of that disease, etc. The researcher may convert these indicators into variables that can be measured. For instance, number of deaths caused by that disease, number of patients who were again affected by that disease, can be the indicators of the said concept.

Formation of Index

After determining multiple elements of a particular concept and selecting suitable indicators for the research, the researcher needs to combine all the indicators into a summated scale because separate indicators cannot give a certain measurement of the concept. For example, a price index is based on the weighted sum of prices.

Criteria of Good Measurement Tool

What should be the characteristics of a good measurement tool? The answer to this question is that the tool should clearly and accurately indicate what the researcher intends to measure. In addition, the good measurement tool should be easy to use and should give reliable results.

The two major characteristics of a good measurement tool are explained as follows:

Reliability

Reliability of a good measurement tool refers to the degree of confidence with which the measurement tool can be used to derive consistent results upon repeated application. A reliable instrument is not necessarily a valid instrument. However, a valid instrument must be reliable.

For example, a scale is used to measure the weight of objects. The scale consistently shows all objects to be overweight by 2 kg. In that case, the scale can said to be reliable because it is consistent, but it is not valid at all.

The reliability of the measurement tool can be affected because of factors such as subject-related factors, observer/interviewer reliability, instrument reliability and situational reliability. Reliability of physical instruments can be tested by using calibration. However, in case of non-physical instruments such as questionnaires, the reliability of instruments is based on their stability and internal consistency. Test-retest method is used to measure reliability of the instruments.

Validity refers to the degree to which a measurement tool succeeds in measuring what is expected to be measured. Reliability and validity are interdependent concepts. There can be reliability without validity; however, there can be no validity without reliability. There can be three types of validity based on which the validity of a measuring instrument is assessed.

These are content validity, criterion-related validity and construct validity. Content validity refers to the judgement of one or more subject matter experts regarding the measurement tool. At times, researchers use a well-established measurement procedure [for example, an established survey (say Survey A) for measuring stress level] that measures a variable of interest (say, V) as a base for developing another measurement procedure [for example, newly created survey (say Survey B) for measuring stress level] that measures the same variable of interest V.

Construct validity indicates how well the measurement tool measures different constructs. Validity is assessed by one of three methods: content validation, criterion-related validation and construct validation.

Business Ethics

( Click on Topic to Read )

  • What is Ethics?
  • What is Business Ethics?
  • Values, Norms, Beliefs and Standards in Business Ethics
  • Indian Ethos in Management
  • Ethical Issues in Marketing
  • Ethical Issues in HRM
  • Ethical Issues in IT
  • Ethical Issues in Production and Operations Management
  • Ethical Issues in Finance and Accounting
  • What is Corporate Governance?
  • What is Ownership Concentration?
  • What is Ownership Composition?
  • Types of Companies in India
  • Internal Corporate Governance
  • External Corporate Governance
  • Corporate Governance in India
  • What is Enterprise Risk Management (ERM)?
  • What is Assessment of Risk?
  • What is Risk Register?
  • Risk Management Committee

Corporate social responsibility (CSR)

  • Theories of CSR
  • Arguments Against CSR
  • Business Case for CSR
  • Importance of CSR in India
  • Drivers of Corporate Social Responsibility
  • Developing a CSR Strategy
  • Implement CSR Commitments
  • CSR Marketplace
  • CSR at Workplace
  • Environmental CSR
  • CSR with Communities and in Supply Chain
  • Community Interventions
  • CSR Monitoring
  • CSR Reporting
  • Voluntary Codes in CSR
  • What is Corporate Ethics?

Lean Six Sigma

  • What is Six Sigma?
  • What is Lean Six Sigma?
  • Value and Waste in Lean Six Sigma
  • Six Sigma Team
  • MAIC Six Sigma
  • Six Sigma in Supply Chains
  • What is Binomial, Poisson, Normal Distribution?
  • What is Sigma Level?
  • What is DMAIC in Six Sigma?
  • What is DMADV in Six Sigma?
  • Six Sigma Project Charter
  • Project Decomposition in Six Sigma
  • Critical to Quality (CTQ) Six Sigma
  • Process Mapping Six Sigma
  • Flowchart and SIPOC
  • Gage Repeatability and Reproducibility
  • Statistical Diagram
  • Lean Techniques for Optimisation Flow
  • Failure Modes and Effects Analysis (FMEA)
  • What is Process Audits?
  • Six Sigma Implementation at Ford
  • IBM Uses Six Sigma to Drive Behaviour Change
  • Research Methodology
  • What is Research?

What is Hypothesis?

  • Sampling Method

Research Methods

  • Data Collection in Research
  • Methods of Collecting Data
  • Application of Business Research
  • Levels of Measurement
  • What is Sampling?
  • Hypothesis Testing
  • Research Report
  • What is Management?
  • Planning in Management
  • Decision Making in Management
  • What is Controlling?
  • What is Coordination?
  • What is Staffing?
  • Organization Structure
  • What is Departmentation?
  • Span of Control
  • What is Authority?
  • Centralization vs Decentralization
  • Organizing in Management
  • Schools of Management Thought
  • Classical Management Approach
  • Is Management an Art or Science?
  • Who is a Manager?

Operations Research

  • What is Operations Research?
  • Operation Research Models
  • Linear Programming
  • Linear Programming Graphic Solution
  • Linear Programming Simplex Method
  • Linear Programming Artificial Variable Technique
  • Duality in Linear Programming
  • Transportation Problem Initial Basic Feasible Solution
  • Transportation Problem Finding Optimal Solution
  • Project Network Analysis with Critical Path Method
  • Project Network Analysis Methods
  • Project Evaluation and Review Technique (PERT)
  • Simulation in Operation Research
  • Replacement Models in Operation Research

Operation Management

  • What is Strategy?
  • What is Operations Strategy?
  • Operations Competitive Dimensions
  • Operations Strategy Formulation Process
  • What is Strategic Fit?
  • Strategic Design Process
  • Focused Operations Strategy
  • Corporate Level Strategy
  • Expansion Strategies
  • Stability Strategies
  • Retrenchment Strategies
  • Competitive Advantage
  • Strategic Choice and Strategic Alternatives
  • What is Production Process?
  • What is Process Technology?
  • What is Process Improvement?
  • Strategic Capacity Management
  • Production and Logistics Strategy
  • Taxonomy of Supply Chain Strategies
  • Factors Considered in Supply Chain Planning
  • Operational and Strategic Issues in Global Logistics
  • Logistics Outsourcing Strategy
  • What is Supply Chain Mapping?
  • Supply Chain Process Restructuring
  • Points of Differentiation
  • Re-engineering Improvement in SCM
  • What is Supply Chain Drivers?
  • Supply Chain Operations Reference (SCOR) Model
  • Customer Service and Cost Trade Off
  • Internal and External Performance Measures
  • Linking Supply Chain and Business Performance
  • Netflix’s Niche Focused Strategy
  • Disney and Pixar Merger
  • Process Planning at Mcdonald’s

Service Operations Management

  • What is Service?
  • What is Service Operations Management?
  • What is Service Design?
  • Service Design Process
  • Service Delivery
  • What is Service Quality?
  • Gap Model of Service Quality
  • Juran Trilogy
  • Service Performance Measurement
  • Service Decoupling
  • IT Service Operation
  • Service Operations Management in Different Sector

Procurement Management

  • What is Procurement Management?
  • Procurement Negotiation
  • Types of Requisition
  • RFX in Procurement
  • What is Purchasing Cycle?
  • Vendor Managed Inventory
  • Internal Conflict During Purchasing Operation
  • Spend Analysis in Procurement
  • Sourcing in Procurement
  • Supplier Evaluation and Selection in Procurement
  • Blacklisting of Suppliers in Procurement
  • Total Cost of Ownership in Procurement
  • Incoterms in Procurement
  • Documents Used in International Procurement
  • Transportation and Logistics Strategy
  • What is Capital Equipment?
  • Procurement Process of Capital Equipment
  • Acquisition of Technology in Procurement
  • What is E-Procurement?
  • E-marketplace and Online Catalogues
  • Fixed Price and Cost Reimbursement Contracts
  • Contract Cancellation in Procurement
  • Ethics in Procurement
  • Legal Aspects of Procurement
  • Global Sourcing in Procurement
  • Intermediaries and Countertrade in Procurement

Strategic Management

  • What is Strategic Management?
  • What is Value Chain Analysis?
  • Mission Statement
  • Business Level Strategy
  • What is SWOT Analysis?
  • What is Competitive Advantage?
  • What is Vision?
  • What is Ansoff Matrix?
  • Prahalad and Gary Hammel
  • Strategic Management In Global Environment
  • Competitor Analysis Framework
  • Competitive Rivalry Analysis
  • Competitive Dynamics
  • What is Competitive Rivalry?
  • Five Competitive Forces That Shape Strategy
  • What is PESTLE Analysis?
  • Fragmentation and Consolidation Of Industries
  • What is Technology Life Cycle?
  • What is Diversification Strategy?
  • What is Corporate Restructuring Strategy?
  • Resources and Capabilities of Organization
  • Role of Leaders In Functional-Level Strategic Management
  • Functional Structure In Functional Level Strategy Formulation
  • Information And Control System
  • What is Strategy Gap Analysis?
  • Issues In Strategy Implementation
  • Matrix Organizational Structure
  • What is Strategic Management Process?

Supply Chain

  • What is Supply Chain Management?
  • Supply Chain Planning and Measuring Strategy Performance
  • What is Warehousing?
  • What is Packaging?
  • What is Inventory Management?
  • What is Material Handling?
  • What is Order Picking?
  • Receiving and Dispatch, Processes
  • What is Warehouse Design?
  • What is Warehousing Costs?

You Might Also Like

What is research design features, components, what is research design types, what is hypothesis definition, meaning, characteristics, sources, what is measure of central tendency, what is sample size determination, formula, determining,, what is research problem components, identifying, formulating,, what is causal research advantages, disadvantages, how to perform, steps in questionnaire design, what is parametric tests types: z-test, t-test, f-test, what is research types, purpose, characteristics, process, leave a reply cancel reply.

You must be logged in to post a comment.

World's Best Online Courses at One Place

We’ve spent the time in finding, so you can spend your time in learning

Digital Marketing

Personal Growth

a measurement in research

a measurement in research

Development

a measurement in research

a measurement in research

a measurement in research

  • Boston University Libraries

Tests and Measures

  • Resources for Tests and Measures
  • Subject Guides for Your Research

The term "tests and measures" refers to tools of measurement for analytical and diagnostic purposes. In the field of social work this might refer to a survey, instrument, diagnostic test, exam, questionnaire, survey, and/or measure. 

Engagement with Tests and Measures

As a student, course work for classes in the Social Work program may have references to the tests and measures researchers used in their analysis. As you continue to engage in research and your own work as a practitioner, you might use tests and measures to begin to identify your own conclusions.

Tests and measures can be a tool of evaluation. For example, if researching social health of older adults in a population, you might engage and assign a score based on responses to the  Geriatric Depression Scale.  Names for these tests and measures often are very direct and clear, for example:

  • Social Support Satisfaction Measure
  • Depression Sensitivity Index
  • Hunger Scale

Resources for finding the tests and measures relevant to your topic of research are available below. 

Books with Tests and Measures

Cover Art

  • Next: Resources for Tests and Measures >>
  • Last Updated: Aug 8, 2024 1:30 PM
  • URL: https://library.bu.edu/testsandmeasures

a measurement in research

1st Edition

Meaning and Measurement in Comparative Housing Research

VitalSource Logo

  • Taylor & Francis eBooks (Institutional Purchase) Opens in new tab or window

Description

The last two decades have seen a marked growth in comparative research within the field of housing studies. This reflects the increasing globalisation of housing finance and therefore the interconnectedness of housing markets, growing interest among researchers and policy makers in learning from developments in other countries and the availability of more funding and better comparative data to support their endeavours. Concurrently, comparative housing research has become more sophisticated, as research training has improved, the number of journals publishing this research has increased and researchers have become what one might call more ‘methodologically aware’. However, despite these developments, there is no single volume book that deals with the distinct challenges that arise from comparative housing research, compared to other fields of comparative policy analysis. These challenges relate to spatial fixity of housing, its dual role as a consumption and investment good, and as the "wobbly pillar" of the welfare state, which is delivered using a complex mix of government and market supports. This volume reflects on the significant methodological strides made in the comparative housing research field during this period. The book also considers the considerable challenges that remain if comparative housing research is to match the methodological and theoretical sophistication evident in other comparative social science fields and maps a route for this journey. This book was published as a special issue of the International Journal of Housing Policy .

Table of Contents

Mark Stephens is Professor of Public Policy at the Institute for Housing, Urban and Real Estate Research, Heriot-Watt University, Edinburgh. Michelle Norris is a senior lecturer in social policy at the School of Applied Social Science, University College Dublin, Ireland. The editors jointly convene the European Network for Housing Research Working Group on Comparative Housing Policy.

Critics' Reviews

"Meaning and Measurement in Comparative Housing Research reflects the ongoing interest in comparative housing research, as well as ongoing efforts to reflect on and improve methodological approaches, and to draw on wider social science disciplines to work out new ways forward. After a decade’s neglect of method in comparative housing research, it is a welcome contribution." – Urban Studies, Sean McNelis, Swinburne Institute for Social Research, Australia " When reading the articles included in this interesting edited volume, it becomes evident that the field of comparative housing research has indeed witnessed farreaching developments and innovations over the past years... this volume provides a comprehensive overview of recent studies and recommendations in the field of comparative housing research, which will undoubtedly prove useful for students and practitioners alike." – Elise de Vuijst, Delft University of Technology, Delft, The Netherlands  

About VitalSource eBooks

VitalSource is a leading provider of eBooks.

  • Access your materials anywhere, at anytime.
  • Customer preferences like text size, font type, page color and more.
  • Take annotations in line as you read.

Multiple eBook Copies

This eBook is already in your shopping cart. If you would like to replace it with a different purchasing option please remove the current eBook option from your cart.

Book Preview

a measurement in research

The country you have selected will result in the following:

  • Product pricing will be adjusted to match the corresponding currency.
  • The title Perception will be removed from your cart because it is not available in this region.
  • Open access
  • Published: 06 August 2024

Development and psychometric evaluation of the Implementation Support Competencies Assessment

  • Todd M. Jensen   ORCID: orcid.org/0000-0002-6930-899X 1 ,
  • Allison J. Metz   ORCID: orcid.org/0000-0002-0369-7021 2 &
  • Bianca Albers   ORCID: orcid.org/0000-0001-9555-0547 3  

Implementation Science volume  19 , Article number:  58 ( 2024 ) Cite this article

332 Accesses

9 Altmetric

Metrics details

Implementation support practitioners (ISPs) are professionals that support others to implement evidence-informed practices, programs, and policies in various service delivery settings to achieve population outcomes. Measuring the use of competencies by ISPs provides a unique opportunity to assess an understudied facet of implementation science—how knowledge, attitudes, and skills used by ISPs affects sustainable change in complicated and complex service systems. This study describes the development and validation of a measure—the Implementation Support Competencies Assessment (ISCA)—that assesses implementation support competencies, with versatile applications across service contexts.

Recently developed practice guide materials included operationalizations of core competencies for ISPs across three domains: co-creation and engagement, ongoing improvement, and sustaining change. These operationalizations, in combination with recent empirical and conceptual work, provided an initial item pool and foundation on which to advance measurement development, largely from a confirmatory perspective (as opposed to exploratory). The measure was further refined through modified cognitive interviewing with three highly experienced ISPs and pilot-testing with 39 individuals enrolled in a university-based certificate program in implementation practice. To recruit a sample for validation analyses, we leveraged a listserv of nearly 4,000 individuals who have registered for or expressed interest in various events and trainings focused on implementation practice offered by an implementation science collaborative housed within a research-intensive university in the Southeast region of the United States. Our final analytic sample included 357 participants who self-identified as ISPs.

Assessments of internal consistency reliability for each competency-specific item set yielded evidence of strong reliability. Results from confirmatory factor analyses provided evidence for the factorial and construct validity of all three domains and associated competencies in the ISCA.

Conclusions

The findings suggest that one’s possession of high levels of competence across each of the three competency domains is strongly associated with theorized outcomes that can promote successful and sustainable implementation efforts among those who receive implementation support from an ISP. The ISCA serves as a foundational tool for workforce development to formally measure and assess improvement in the skills that are required to tailor a package of implementation strategies situated in context.

Peer Review reports

Contribution to the literature

This study describes the development and validation of a measure—the Implementation Support Competencies Assessment (ISCA)—that assesses implementation support competencies. Measuring the use of competencies by implementation support practitioners (ISPs) provides an opportunity to assess an understudied facet of implementation science—how knowledge, attitudes, and skills used by ISPs affects implementation.

Results of the validation study offer evidence of the reliability, factorial validity, and construct validity of the ISCA.

The ISCA serves as a foundational tool for workforce development to measure and improve the skills required to build implementation capacity and to tailor multi-faceted implementation strategies situated in context.

Implementation support practitioners (ISPs) are professionals that support others to implement evidence-informed practices, programs, and policies to achieve population outcomes [ 1 , 2 , 3 , 4 , 5 ]. Several influences have contributed to the increasing attention given to describing and understanding the role of ISPs in building implementation capacity. These factors include: 1) interest in building a competent workforce for supporting implementation and evidence use [ 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 ]; 2) recent publications describing the competencies needed to be effective in an implementation support role [ 1 , 2 , 3 , 4 , 5 , 9 ]; 3) growing calls from the field of implementation science to address the emerging gap between implementation research and implementation practice—referred to as the paradoxical, ironic, or secondary gap [ 10 , 11 , 12 ]; and 4) emerging evidence that the use of multi-faceted implementation strategies to support innovations in health and social care has had limited effects on population outcomes [ 13 ].

Combined, these factors point to a need to understand how other aspects of implementation processes, beyond the use of specific implementation strategies , can contribute to improved implementation and population outcomes. Implementation support competencies could be a promising focal point on this front, which can be conceptualized as mechanisms that support professionals in providing high-quality implementation support and capacity-building; competencies represent an integration of an ISP’s values, knowledge, attitudes, and skills [ 14 ]. Measuring the use of competencies by ISPs provides a unique opportunity to assess an understudied facet of implementation science—how values, knowledge, attitudes, and skills used by ISPs affects sustainable change in service systems.

ISPs rely on technical and relational skills to identify, tailor, and improve evidence-based implementation strategies in different service contexts to ensure high-quality, consistent implementation of evidence-informed practices. To understand how ISPs do this work, it is important to systematically gather data from ISPs on (a) the skills they use to support change and (b) how confident and competent they are in using these skills to build implementation capacity [ 15 ]. Previous research foregrounded the critical question of “what it takes” to build sustainable implementation outcomes that contribute to improved and more equitable outcomes for people and communities [ 15 ]. The identification and explication of competencies for ISPs represents progress in the field toward understanding “what it takes;” however, there has remained a gap in how to measure these competencies well. This study describes the development and validation of a measure—the Implementation Support Competencies Assessment (ISCA) [ 16 ]—that assesses implementation support competencies, with versatile applications across service contexts.

The work of ISPs must account for the dynamic and highly relational nature of implementation that involves the integration of multiple stakeholder perspectives, the identification of crucial barriers to implementation that are often invisible to observers, and the assessment of available resources to address challenges and enhance facilitators [ 1 ]. Developing a workforce that can provide implementation support will require the field of implementation science to look beyond theories, models, and frameworks, and to more deeply understand how to assess and cultivate the competencies required by professionals working to promote and sustain evidence use in human service systems. Studying implementation capacity-building approaches, such as those used by ISPs, that are situated within contexts and emphasize the relational support needed to build organizational capability for service change might be a promising method for understanding how we can achieve improved implementation and population outcomes [ 13 ]. The ISCA is a tool that could support these efforts, and the specific aims of the current study are (a) to test whether the items for each ISCA competency offer a consistent and accurate representation of that competency (i.e., reliability); (b) confirm the hypothesized factor-structure of the competencies and domains within which they are nested (i.e., factorial validity), and (c) assess whether the measures are significantly associated with hypothesized outcomes of implementation support (i.e., construct validity).

Measurement development process

Our process of developing the ISCA was informed by DeVellis [ 17 ], whereby we engaged in a systematic and rigorous process of measurement development. To begin, we leveraged recent scholarship that offers clear and rich descriptions of the constructs intended for measurement—the 15 core competencies posited to undergird effective implementation support [ 1 , 2 , 3 , 4 , 5 ]. Recently developed practice guide materials intended to inform the work of ISPs also include operationalizations or core activities for each core competency [ 18 ]. These operationalizations, in combination with recent empirical and conceptual work noted above, provided an initial item pool (116 items across the 15 competencies) and foundation on which to advance measurement development, largely from a confirmatory perspective (as opposed to exploratory).

Next, we sought to identify an optimal format for measurement. This process was informed by other extant competency measures and our desire to balance parsimony (low respondent burden) with informativeness. Ultimately, we selected an ordinal-level response-option set whereby individuals could self-report their level of perceived competence with respect to each item. Consistent with other existing competency self-assessments [ 19 ], we selected the following response-option set: 1 = not at all competent, 2 = slightly competent, 3 = moderately competent, 4 = very competent, and 5 = extremely competent. The research team then initiated a three-stage process for item review and refinement. The first stage involved members of the research team identifying opportunities to simplify and consolidate possible items in the item pool. This led to a slight reduction in items (now 113) and item simplification.

The second stage involved use of modified cognitive interviewing with three experienced ISPs. The three participants were invited to review the assessment items in preparation for their interview, and during their interview (about 60 minutes) they were asked the following questions for each competency item set: (a) how clear are the items for this competency? (b) how accessible do the items feel for potential users? (c) what changes, if any, would you recommend for these items? Feedback from respondents led to several minor edits, shifts in terminology (e.g., use of “partner” instead of “stakeholder”), and opportunities to further clarify language used in some items (e.g., defining “champions”). All potential item revisions were reviewed and accepted by two research team members with extensive implementation research and practice experience.

The third stage involved pilot-testing the assessment with a group of professionals who were enrolled in a university-based certificate program focused on cultivating ISP core competencies. Prior to the delivery of certificate program content, participants were asked to complete the ISCA. Following the completion of each competency-specific item set, participants were given the following open-ended prompts: (a) please identity any items that felt unclear or confusing; (b) please identify any language used in these items that was difficult to understand; and (c) please provide any other thoughts or insights you would like to share about these items. The assessment was completed by 39 individuals, enabling us to tentatively assess internal consistency reliability for each competency item set (Cronbach’s alpha values ranged from .70 to .94; McDonald’s omega values ranged from .70 to .95), as well as the distributional properties of item responses (results indicated the items were not burdened significantly by skewness or kurtosis). We were also able to leverage open-ended feedback to incorporate several minor item edits, which were again reviewed and approved by the same two members of the research team.

Our next step was to prepare the assessment for validation analyses. In addition to the assessment items, we developed a set of items intended to measure two core constructs posited to be associated with the ISP core competencies [ 2 ]. One construct represented ISP gains, or the extent to which ISPs report receiving recognition, credibility, and respect from those who receive their implementation support. The second construct represented recipient benefits, or the extent to which ISPs perceive the recipients of their support experiencing increases in (a) relational capacities with the ISP, (b) implementation capability, (c) implementation opportunities, and (d) implementation motivation [ 2 ]. More details about the specific items used to measure these constructs and the ISCA are provided in the Final Measures subsection.

Data collection and sample

To recruit a sample for validation analyses, we leveraged a listserv of nearly 4,000 individuals who have registered for or expressed interest in various events and trainings focused on implementation practice offered by an implementation science collaborative housed within a research-intensive university in the Southeast region of the United States. A series of emails were sent to members of this listserv describing our efforts to validate the ISCA, with an invitation to participate. Voluntary responses (no participation incentives were offered) were collected between June and November 2023 using Qualtrics, a web-based survey platform. The survey included informed consent materials, items to collect information about respondent sociodemographic and professional characteristics, the ISCA items, and validation items. The median completion time for the survey was 22.7 minutes among the 357 participants in our final analytic sample.

Table 1 features an overview of participant characteristics. The majority of participants identified as women (84%), with 15% identifying as men, 1% identifying as gender nonconforming, and 1% preferring not to provide information about their gender identity (percentages are rounded, resulting in the possibility that the total exceeds 100%). Participants could select all racial and ethnic identifies that applied to them; 76% identified as White, 11% identified as Black, 9% identified as Asian, 7% identified as Hispanic, 1% identified as Native American/American Indian/Alaska Native, 0.3% identified as Pacific Islander, 3% identified as other, and 2% preferred not to provide information about their racial/ethnic identity. Six continents of residence were represented among participants, with 78% of participants residing in North America, 7% in Europe, 6% in Australia, 4% in Asia, 4% in Africa, and 2% in South America. Thirty-eight percent indicated having more than 15 years of professional experience, 23% indicated having one-to-five years of experience, 22% indicated have six-to-ten years of experience, and the remaining 17% indicated having between 11 and 15 years of experience. The following service types were well represented among participants (more than one type could be indicated by participants): public health (32%), health (31%), mental and behavioral health (26%), child welfare (22%), and K-12 education (18%), among others. The three most common work settings were non-profit organizations (36%), higher education (27%), and state government (20%; more than one setting could be indicated by participants). See Table 1 for more details.

Final measures

Implementation support competencies assessment (isca).

Rooted in recent scholarship and foundational steps of measurement development described earlier, the ISCA included item sets (ranging from 5 to 15 items and totaling 113 items) intended to measure each of 15 core competencies posited to undergird effective implementation support, with competencies nested within one of three overarching domains: co-creation and engagement, ongoing improvement, and sustaining change. The co-creation and engagement domain included items designed to measure the following five competencies: co-learning (6 items), brokering (6 items), address power differentials (7 items), co-design (6 items), and tailoring support (7 items). See Appendix 1 for a list of all items associated with this domain. The ongoing improvement domain included items designed to measure the following six competencies: assess needs and assets (6 items); understand context (6 items); apply and integrate implementation frameworks, strategies, and approaches (5 items); facilitation (9 items); communication (6 items); and conduct improvement cycles (6 items). See Appendix 2 for a list of all items associated with this domain. The sustaining change domain included items designed to measure the following four competencies: grow and sustain relationships (11 items), develop teams (15 items), build capacity (8 items), and cultivate leaders and champions (9 items). See Appendix 3 for a list of all items associated with this domain. Information about internal consistency reliability for each item set is featured in the Results section as a key component of the psychometric evaluation of the ISCA.

When completing the ISCA, participants were instructed to reflect on their experiences supporting implementation in various settings, review each item, and assess their level of competence by selecting one of the following response options: not at all competent (1), slightly competent (2), moderately competent (3), very competent (4), or extremely competent (5). If participants did not have direct experience with a particular item, they were instructed to indicate how competent they would expect themselves to be if they were to conduct the activity described in the item.

Validation constructs

Consistent with the mechanisms of implementation support articulated by Albers et al. [ 2 ], we developed and refined multi-item scales intended to measure two constructs theorized to be byproducts of ISPs possessing proficiency across the 15 core competencies of implementation support provision. Specifically, we developed three items intended to measure ISP gains ; or the extent to which ISPs receive recognition, credibility, and respect from those who receive their implementation support. Specifically, participants were asked to indicate their level of agreement (ranging from 1 = Strongly Disagree to 5 = Strongly Agree) with the following three statements: “I have credibility among those who receive my implementation support,” “I am respected by those who receive my implementation support,” and “My expertise is recognized by those who receive my implementation support.”

We also developed ten items intended to measure recipient benefits , or the extent to which ISPs perceive the recipients of their support experiencing increases in (a) relational capacities with the ISP, (b) implementation capability, (c) implementation opportunities, and (d) implementation motivation [ 2 ]. Specifically, participants were asked to indicate their level of agreement (ranging from 1 = Strongly Disagree to 5 = Strongly Agree) with the following ten statements: “I am trusted by those who receive my implementation support;” “Those who receive my implementation support feel safe trying new things, making mistakes, and asking questions;” “Those who receive my implementation support increase their ability to address implementation challenges;” “Those who receive my implementation support gain competence in implementing evidence-informed interventions in their local settings;” “I provide opportunities for continued learning to those who receive my implementation support;” “I promote implementation friendly environments for those who receive my implementation support;” “Those who receive my implementation support strengthen commitment to their implementation work;” “Those who receive my implementation support feel empowered to engage in their implementation work;” “Those who receive my implementation support demonstrate accountability in their implementation work;” and “Those who receive my implementation support develop an interest in regularly reflecting on their own implementation work.” Information about internal consistency reliability for item sets related to the two validation constructs is featured in the Results section.

Data analysis

To generate evidence of the internal consistency reliability of competency-specific item sets, we estimated Cronbach’s alpha, McDonald’s omega, and Raykov’s rho coefficients for each of the 15 competencies [ 20 , 21 ]. To generate evidence of the factorial and construct validity of the ISCA, we then employed confirmatory factor analysis (CFA) in Mplus 8.6 [ 22 ]. Consistent with our hypothesized model, we estimated three separate second-order CFA models, one for each of the three competency domains: co-creation and engagement, ongoing improvement, and sustaining change. The first CFA model specified the co-creation and engagement domain as a second-order latent factor with the following five competencies specified as first-order latent factors: co-learning, brokering, address power differentials, co-design, and tailoring support. The second CFA model focused on the ongoing improvement domain as a second-order latent factor with the following six competencies specified as first-order latent factors: assess needs and assets; understand context; apply and integrate implementation frameworks, strategies, and approaches; facilitation; communication; and conduct improvement cycles. The third CFA model focused on the sustaining change domain as a second-order latent factor with the following four competencies specified as first-order latent factors: grow and sustain relationships, develop teams, build capacity, and cultivate leaders and champions. In all three models, ISP gains and recipient benefits were regressed on the second-order domain factor, and the error terms for the validation constructs were allowed to covary.

For purposes of model identification and calibrating the latent-factor metrics, we fixed first- and second-order factor means to a value of 0 and variances to a value of 1. To accommodate the ordinal-level nature of the ISCA items (and items used to measure the validation constructs), we employed the means- and variance-adjusted weighted least squares (WLSMV) estimator and incorporated a polychoric correlation input matrix [ 23 ]. Some missing values were present in the data, generally reflecting a steady rate of attrition as participants progressed through the ISCA. Consequently, the analytic sample for each second-order factor model varied, such that the model for the co-creation and engagement domain possessed all 357 participants, the model for the ongoing improvement domain possessed 316 participants, and the model for the sustaining change domain possessed 296 participants. Within each model, pairwise deletion was used to handle missing data, which enables the flexible use of partial responses across model variables to estimate model parameters. Missing values were shown to meet the assumption of Missing Completely at Random (MCAR) per Little’s multivariate test of MCAR ( \({\chi }^{2}\) [94] = 83.47, p = 0.77), a condition under which pairwise deletion performs well [ 24 , 25 ].

To assess model fit, the following indices and associated values were prespecified as being indicative of good model fit: Comparative Fit Index (CFI) and Tucker-Lewis Index (TLI) values greater than 0.95, standardized root mean square residual (SRMR) values less than .08, and root mean square error of approximation (RMSEA) values less than or equal to 0.06 (including the upper-level 90% confidence interval) [ 26 , 27 ]. Each factor-analytic model was over-identified and sufficiently powered to detect not-close model fit [ 28 ].

Ethics approval

We submitted our study proposal (study #: 23-0958) to our university’s Office of Human Research Ethics, whereby our study was approved and determined to be exempt from further review.

Internal consistency reliability

Assessments of internal consistency reliability for each competency-specific item set yielded evidence of strong reliability. Specifically, Cronbach’s alpha, McDonald’s omega, and Raykov’s rho ranged from 0.82 to 0.96 across competencies. Internal consistency reliability was also strong for the two validation constructs. For ISP gains, Cronbach’s alpha and Raykov’s rho were 0.86; McDonald’s omega was 0.87. For recipient benefits, Cronbach’s alpha and Raykov’s rho were 0.91; McDonald’s omega was 0.92. See Table 2 for a detailed overview of reliability estimates across competencies and validation constructs.

Factorial and construct validity

Co-creation and engagement domain.

Figure 1 features the second-order factor model with co-creation and engagement specified as a second-order factor and the five corresponding competencies specified as first-order factors. ISP gains and recipient benefits were also regressed on the co-creation and engagement factor. This model yielded good model fit ( \({\chi }^{2}\) [937] = 1857.16, p < .001; CFI = 0.95; TLI = 0.95; SRMR = 0.05; RMSEA = 0.052 [upper-level 90% confidence interval: 0.056]). All first-order standardized factor loadings were statistically significant and valued between 0.66 and 0.87. All standardized second-order factor loadings were statistically significant and valued between 0.89 and 0.93. Per standardized regression coefficients, the co-creation and engagement domain also was significantly and positively associated with ISP gains ( \(\beta\) = 0.62, p < .001; R 2 = 0.38) and recipient benefits ( \(\beta\) = 0.66, p < .001; R 2 = 0.44).

figure 1

Second-Order Confirmatory Factor Analysis of Domain 1 and Construct Validation (Standardized Parameters). Note: Error terms for observed indicators and full measurement models for the two focal endogenous constructs are omitted to retain visual parsimony. All parameter estimates are standardized. *** p < .001. All first-order and second-order factor loadings are significant at p < .001 level. ISP = Implementation support practitioner

Ongoing improvement domain

Figure 2 features the second-order factor model with ongoing improvement specified as a second-order factor and the six corresponding competencies specified as first-order factors. ISP gains and recipient benefits were also regressed on the ongoing improvement factor. This model yielded good model fit, overall ( \({\chi }^{2}\) [1215] = 2707.55, p < .001; CFI = 0.95; TLI = 0.95; SRMR = 0.06; RMSEA = 0.062 [upper-level 90% confidence interval: 0.065]). All first-order standardized factor loadings were statistically significant and valued between 0.68 and 0.96. All second-order standardized factor loadings were statistically significant and valued between 0.80 and 0.95. Per standardized regression coefficients, the ongoing improvement domain also was significantly and positively associated with ISP gains ( \(\beta\) = 0.61, p < .001; ; R 2 = 0.37) and recipient benefits ( \(\beta\) = 0.64, p < .001; R 2 = 0.41).

figure 2

Second-Order Confirmatory Factor Analysis of Domain 2 and Construct Validation (Standardized Parameters). Note: Error terms for observed indicators and full measurement models for the two focal endogenous constructs are omitted to retain visual parsimony. All parameter estimates are standardized. *** p < .001. All first-order and second-order factor loadings are significant at p < .001 level. FSA = frameworks, strategies, and approaches; ISP = implementation support practitioner

Sustaining change domain

Figure 3 features the second-order factor model with sustaining change specified as a second-order factor and the four corresponding competencies specified as first-order factors. ISP gains and recipient benefits were also regressed on the sustaining change factor. This model yielded good model fit ( \({\chi }^{2}\) [1477] = 2927.16, p < .001; CFI = 0.96; TLI = 0.96; SRMR = 0.05; RMSEA = 0.058 [upper-level 90% confidence interval: 0.061]). All first-order standardized factor loadings were statistically significant and valued between 0.79 and 0.94. All second-order standardized factor loadings were statistically significant and valued between 0.88 and 0.94. Per standardized regression coefficients, the sustaining change domain also was significantly and positively associated with ISP gains ( \(\beta\) = 0.69, p < .001; R 2 = 0.48) and recipient benefits ( \(\beta\) = 0.75, p < .001; R 2 = 0.57).

figure 3

Second-Order Confirmatory Factor Analysis of Domain 3 and Construct Validation (Standardized Parameters). Note: Error terms for observed indicators and full measurement models for the two focal endogenous constructs are omitted to retain visual parsimony. All parameter estimates are standardized. *** p < .001. All first-order and second-order factor loadings are significant at p < .001 level. ISP = Implementation support practitioner

Across all three models, standardized factor loadings associated with the validation constructs were statistically significant and ranged between 0.72 and 0.95. These details were omitted from figures to preserve visual parsimony. Taken together, results from the three models provided evidence for the factorial and construct validity of all three domains and associated competencies in the ISCA. See Appendices 1, 2, and 3 for summaries of standardized factor loadings, item communalities (i.e., proportion of item variance attributable to its corresponding latent factor), and item response frequencies. A correlation matrix of all study variables is available upon request.

Tests of alternative models

With respect to alternative models, we compared the second-order factor specification for each domain with models in which only first-order factors were specified (and allowed to correlate). We then assessed differences in model fit between the first-order and second-order factor specifications. Leveraging the guidelines provided by Cheung and Rensvold [ 29 ], we specifically assessed differences in CFI values to determine whether alternative models differed significantly. Decreases in CFI values of more than 0.01-units between an original model and alternative model would indicate a significant worsening of model fit. For all three domains, the first-order specification and second-order specification did not differ significantly (i.e., changes in CFI did not exceed 0.003-units in any case; more details about these analyses are available upon request). When alternative models yield statistically negligible differences in model fit, it is good practice to favor the more parsimonious specification (i.e., the model with fewer parameter estimates and more degrees of freedom). Because second-order factor structures are more parsimonious than first-order factor structures (with all possible first-order factor correlations), we retained the second-order factor models as optimal.

Response rates and evidence of acceptability

As noted earlier, response rates steadily declined as participants progressed through the ISCA. As reported in Table 2 , the number of responses provided for the items associated with each competency ranged from a high of 357 (the first competency) and decreased linearly to a low of 290 (the fifteenth and final competency). The average attrition rate from competency-to-competency was 1.5%. Moreover, we did not observe any anomalous are unexpected levels of data missingness for any particular item.

Open-ended feedback from pilot-test participants also provided evidence of the acceptability of the ISCA. Pilot-test participants described the ISCA as thorough, clear, easy to understand, and applicable to their work. The ISCA also was described as a tool that could support self-reflection and guide professional development efforts. One pilot-test participant even stated that they “really enjoyed” completing the ISCA.

The purpose of the current study was to psychometrically evaluate the ISCA, a promising assessment instrument intended to measure levels of competence across 15 core competencies posited to undergird the effective provision of implementation support in various service delivery settings. Our results offer evidence of the internal consistency reliability and factorial validity of the ISCA, including its three specific domains and associated competencies. The strength of relationships between each domain and the specified validation constructs—ISP gains and recipient benefits—also provide notable evidence of the construct validity of the ISCA. In alignment with the mechanisms of implementation support articulated by Albers and colleagues [ 2 ], our findings suggest that one’s possession of high levels of competence across each of the three competency domains is strongly associated with theorized outcomes that can promote successful and sustainable implementation efforts among those who receive implementation support from an ISP.

It is important to highlight that previous research undergirding the identification and operationalization of the ISCA competencies included an integrated systematic review of strategies used by ISPs and the skills needed to use these strategies in practice, along with survey research and interviews that centered the experiences of professionals providing implementation support in diverse service sectors and geographic regions [ 1 , 2 , 3 , 4 , 5 ]. This previous research on ISPs identified the high level of skill required by those providing implementation support, leading to questions about how to select, recruit, and build the capacity of these professionals.

The ISCA serves as a foundational tool for workforce development to measure and improve the skills that are required to both engage in the relational and complex processes involved in building implementation capacity and to tailor a package of implementation strategies situated in context. As we seek to understand how the strategies and skills used by ISPs bring about change in service systems, the ISCA can be used to answer key questions posed by Albers and colleagues [ 3 ] including (a) how these competencies can be built and maintained in service settings in ways that activate mechanisms for change, (b) how different skills may be needed in different settings and contexts, and (c) how the roles of ISPs can be supported to establish cultures of learning and evidence use.

As we seek to understand how ISPs activate mechanisms of change to support implementation and evidence use, the ISCA can be used to support self-assessments that identify areas of strength and professional development opportunities for growing the skills needed to build implementation capacity. Supervisors can use the ISCA to inform professional development and decisions around recruitment and hiring. Taken together, the ISCA can be used to further define the role of key actors in the field of implementation science who represent the implementation support system [ 30 ].

Future directions and limitations

The ISCA is foundational for future research on the role of implementation support and can be used for evidence-building related to implementation practice. For example, the ISCA can be used to assess whether trainings and other implementation support capacity-building activities promote gains in core competencies. The ISCA also can be used for basic implementation research including assessing the extent to which possession and use of particular competencies is associated with implementation progress across implementation stages [ 31 ] and long-term implementation outcomes in real-world settings [ 32 , 33 ].

As we seek to understand the characteristics of effective implementation teams and champions, the ISCA also can be used to identify “clusters” of competencies that appear to bolster specific support roles in various implementation efforts. Moreover, research leveraging the ISCA might be well positioned to identify the presence of specific competency portfolios possessed by members of implementation teams, highlighting the potential for teams to assemble a group of ISPs who, when brought together, provide coverage of the various competencies that undergird effective implementation support. Research on this front seems promising, as it is unlikely that any single ISP would possess high levels of competence across all 15 competencies reflected in the ISCA.

The current study possesses some limitations that should shape interpretations and conclusions. First, although there was notable diversity in the analytic sample with respect to sociodemographic and professional characteristics, study findings likely generalize best to ISPs who reside in North American and identify as White women. Second, the total analytic sample size was insufficiently large to support multiple group comparison analyses (i.e., tests of measurement invariance), whereby the psychometric properties of the ISCA could be compared across meaningful subgroups (e.g., continent of residence, gender identity, racial/ethnic identity, service type, service setting). Future research should seek to recruit very large samples that would support such analyses, which could highlight whether the ISCA performs equivalently across various respondent characteristics. Third, as a self-assessment tool, the ISCA is potentially subject to the common biases inherent in self-report data. Moreover, study participants provided both their competency assessments and responses to the outcome measures used for validation purposes. Consequently, associations between the ISCA constructs and validation constructs could be inflated due to common method variance. Future research should endeavor to validate the ISCA using outcome measures collected from the recipients of implementation support. Indeed, we view the current study as a launching point for a larger body of work intended to robustly validate the ISCA.

This study brings together several years of theory development and research on the role of ISPs and the competencies that are needed for them to be successful in their role. To date, a psychometrically validated measure of implementation support competencies has not been available. Results from the current study showcase a promising, psychometrically robust assessment tool on this front—the Implementation Support Competencies Assessment (ISCA). As a whole, results from this study also provide compelling evidence of reliability and validity with respect to the implementation support competencies identified by Metz and colleagues. Using the ISCA can shed light on the black box of many current implementation studies that fail to show positive effects of specific implementation strategies on implementation outcomes [ 13 ]. The ISCA enables understanding of the level of competency with which implementation strategies are selected, tailored, and delivered, which may be as important as the specific strategy or package of strategies selected. At the very least, the ISCA can support efforts to understand the impact that competent (or less competent) implementation support has on the outcomes of a particular implementation effort.

Availability of data and materials

The datasets used and/or analyzed during the current study are available from the corresponding author upon reasonable request.

Abbreviations

Implementation Support Practitioner

Implementation Support Competencies Assessment

Confirmatory Factor Analysis

Comparative Fit Index

Tucker-Lewis Index

Root Mean Error Square of Approximation

Weighted least squares, means- and variance-adjusted

Missing Completely at Random

Albers B, Metz A, Burke K. Implementation support practitioners- A proposal for consolidating a diverse evidence base. BMC Health Serv Res. 2020;20(1):368.

Article   PubMed   PubMed Central   Google Scholar  

Albers B, Metz A, Burke K, Bührmann L, Bartley L, Driessen P, et al. The Mechanisms of Implementation Support - Findings from a Systematic Integrative Review. Res Soc Work Pract. 2022;32(3):259–80.

Article   Google Scholar  

Albers B, Metz A, Burke K, Bührmann L, Bartley L, Driessen P, et al. Implementation Support Skills: Findings From a Systematic Integrative Review. Res Soc Work Pract. 2021;31(2):147–70.

Metz A, Albers B, Burke K, Bartley L, Louison L, Ward C, et al. Implementation Practice in Human Service Systems: Understanding the Principles and Competencies of Professionals Who Support Implementation. Hum Serv Organ Manag Leadersh Gov. 2021;45(3):238–59.

Google Scholar  

Bührmann L, Driessen P, Metz A, Burke K, Bartley L, Varsi C, et al. Knowledge and attitudes of Implementation Support Practitioners—Findings from a systematic integrative review. Ochodo E, editor. PLoS One. 2022;17(5):e0267533.

Moore JE, Rashid S, Park JS, Khan S, Straus SE. Longitudinal evaluation of a course to build core competencies in implementation practice. Implement Sci. 2018;13(1):1–13.

Mosson R, Augustsson H, Bäck A, Åhström M, Von Thiele Schwarz U, Richter A, et al. Building implementation capacity (BIC): A longitudinal mixed methods evaluation of a team intervention. BMC Health Serv Res. 2019;19(1):1–12.

Park JS, Moore JE, Sayal R, Holmes BJ, Scarrow G, Graham ID, et al. Evaluation of the “Foundations in Knowledge Translation” training initiative: preparing end users to practice KT. Implement Sci. 2018;13(1):63.

William A, Aldridge I, Roppolo RH, Brown J, Bumbarger BK, Boothroyd RI. Mechanisms of change in external implementation support: A conceptual model and case examples to  guide research and practice. 2023 Jun 21.  https://doi.org/10.1177/26334895231179761 .

Westerlund A, Sundberg L, Nilsen P. Implementation of Implementation Science Knowledge: The Research-Practice Gap Paradox. Worldviews Evid Based Nurs. 2019;16(5):332–4.

Juckett LA, Bunger AC, McNett MM, Robinson ML, Tucker SJ. Leveraging academic initiatives to advance implementation practice: a scoping review of capacity building interventions. Implement Sci. 2022;17(1):1–14.

Jensen TM, Metz AJ, Disbennett ME, Farley AB. Developing a practice-driven research agenda in implementation science: Perspectives from experienced implementation support practitioners. Implement Res Pract. 2023;3:4.

Boaz A, Baeza J, Fraser A, Persson E. ‘It depends’: what 86 systematic reviews tell us about what strategies to use to support the use of research in clinical practice. Implement Sci. 2024;19(1):1–30.

Mazurek Melnyk B, Gallagher-Ford L, English Long L, Fineout-Overholt E. The establishment of evidence-based practice competencies for practicing registered nurses and advanced practice nurses in real-world clinical settings: Proficiencies to improve healthcare quality, reliability, patient outcomes, and costs. Worldviews Evid Based Nurs. 2014;11(1):5–15.

Metz A, Jensen T, Farley A, Boaz A. Is implementation research out of step with implementation practice? Pathways to effective implementation support over the last decade. Implement Res Pract. 2022;3:263348952211055.

Metz A, Albers B, Jensen T. Implementation support competencies assessment (ISCA). https://doi.org/10.17605/OSF.IO/ZH82F . 2024.

DeVellis R. Scale development: theory and applications. 3rd ed. Thousand Oaks, CA: Sage; 2012.

Metz A, Louison L, Burke K, Albers B, Ward C. Implementation support practitioner profile: Guiding principles and core competencies for implementation practice. University of North Carolina at Chapel Hill; 2020.

Wilson J, Ward C, Fetvadjiev VH, Bethel A. Measuring Cultural Competencies: The Development and Validation of a Revised Measure of Sociocultural Adaptation. J Cross Cult Psychol. 2017;48(10):1475–506.

McNeish D. Thanks coefficient alpha, we’ll take it from here. Psychol Methods. 2018;23(3):412–33.

Article   PubMed   Google Scholar  

Padilla MA, Divers J. A Comparison of Composite Reliability Estimators: Coefficient Omega Confidence Intervals in the Current Literature. Educ Psychol Meas. 2016;76(3):436–53.

Muthén L, Muthén B. Mplus User’s Guide. 8th ed. Los Angeles, CA: Muthén & Muthén; 2017.

Flora DB, Curran PJ. An Empirical Evaluation of Alternative Methods of Estimation for Confirmatory Factor Analysis With Ordinal Data. Psychol Methods. 2004;9(4):466–91.

Shi D, Lee T, Fairchild AJ, Maydeu-Olivares A. Fitting Ordinal Factor Analysis Models With Missing Data: A Comparison Between Pairwise Deletion and Multiple Imputation. Educ Psychol Meas. 2020;80(1):41–66.

Li C. Little’s test of missing completely at random. Stata J. 2013;13(4):795–809.

Browne MW, Cudeck R. Alternative Ways of Assessing Model Fit. Sociol Methods Res. 1992;21(2):230–58.

Hu LT, Bentler PM. Cutoff criteria for fit indexes in covariance structure analysis: Conventional criteria versus new alternatives. Struct Equ Modeling. 1999;6(1):1–55.

MacCallum RC, Browne MW, Sugawara HM. Power analysis and determination of sample size for covariance structure modeling. Psychol Methods. 1996;1(2):130–49.

Cheung GW, Rensvold RB. Evaluating Goodness-of-Fit Indexes for Testing Measurement Invariance. Struct Equ Modeling. 2002;9(2):233–55.

Wandersman A, Duffy J, Flaspohler P, Noonan R, Lubell K, Stillman L, et al. Bridging the Gap Between Prevention Research and Practice: The Interactive Systems Framework for Dissemination and Implementation. Am J Comm Psychol. 2008;41(3–4):171–81.

McGuier EA, Kolko DJ, Stadnick NA, Brookman-Frazee L, Wolk CB, Yuan CT, et al. Advancing research on teams and team effectiveness in implementation science: An application of the Exploration, Preparation, Implementation, Sustainment (EPIS) framework. Implement Res Pract. 2023;1:4.

Proctor E, Silmere H, Raghavan R, Hovmand P, Aarons G, Bunger A, et al. Outcomes for implementation research: Conceptual distinctions, measurement challenges, and research agenda. Adm Policy Ment Health. 2011;38(2):65–76.

Proctor EK, Bunger AC, Lengnick-Hall R, Gerke DR, Martin JK, Phillips RJ, et al. Ten years of implementation outcomes research: a scoping review. Implement Sci. 2023;18(1):1–19.

Download references

Acknowledgements

The authors wish to thank Amanda Farley for her support in preparation of the measure and data collection. The authors also wish to thank Mackensie Disbennett for her support in the process of reviewing and refining measurement items.

No external funding sources supported this study.

Author information

Authors and affiliations.

School of Education, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA

Todd M. Jensen

School of Social Work, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA

Allison J. Metz

Institute for Implementation Science in Healthcare, University of Zurich, Zurich, Switzerland

Bianca Albers

You can also search for this author in PubMed   Google Scholar

Contributions

The first author led analysis and drafted the methods and results sections and reviewed and edited all sections. The second author co-led conceptualization of items and drafted the background and discussion sections and reviewed and edited all sections. The third author co-led conceptualization of items and reviewed and edited all sections.

Corresponding author

Correspondence to Todd M. Jensen .

Ethics declarations

Ethics approval and consent to participate.

The Institutional Review Board of the primary authors’ university reviewed this study and designated it as exempt (IRB #23-0958).

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Implementation Support Competencies Assessment (ISCA) (Metz, Albers, & Jensen, 2024) – Items for the Co-Creation and Engagement Domain, Standardized Factor Loadings, Item Communalities, and Item Response Frequencies

 

Response Frequencies

#

Competency and Item

FL

Com.

1

2

3

4

5

C1.1

Obtain clear understanding of the system, organizational context, and culture in which implementation will take place

0.78

0.61

6%

33%

47%

14%

0%

C1.2

Create opportunities for new ideas to emerge

0.73

0.53

0%

9%

28%

47%

16%

C1.3

Build trust and respect for all perspectives involved in supporting implementation

0.73

0.54

5%

21%

54%

20%

0%

C1.4

Communicate and listen so that you can integrate different perspectives and types of knowledge

0.69

0.48

2%

20%

52%

27%

0%

C1.5

Provide interactive and educational trainings on implementation science

0.66

0.44

7%

23%

28%

28%

14%

C1.6

Tailor approaches to enhance implementation readiness at individual, organizational, and system levels

0.84

0.70

2%

16%

37%

33%

13%

C2.1

Identify individuals or groups that should be involved in implementation and seek to understand why they were not yet included

0.74

0.55

1%

14%

34%

37%

14%

C2.2

Connect individuals or groups who have been disconnected in the system by serving as a relational resource

0.72

0.51

4%

17%

36%

34%

10%

C2.3

Develop and regularly convene implementation groups and teams with diverse partners

0.80

0.64

2%

13%

29%

41%

16%

C2.4

Connect people strategically in a variety of ways when there is a potential for mutual benefit

0.79

0.62

1%

12%

32%

41%

15%

C2.5

Support the use of evidence and data with implementation partners to support implementation

0.77

0.59

1%

7%

26%

39%

28%

C2.6

Promote opportunities for implementation partners to engage with each other in the use of evidence and data

0.83

0.69

1%

12%

31%

41%

16%

C3.1

Put the experiences of end users (e.g., service recipients) at the center of decisions about implementation

0.72

0.51

1%

10%

30%

41%

19%

C3.2

Identify how different partners can influence implementation

0.85

0.72

0%

9%

33%

42%

15%

C3.3

Identify existing power structures in the implementation setting

0.80

0.64

1%

15%

31%

39%

14%

C3.4

Use facilitation techniques to honor all voices involved in implementation

0.77

0.59

2%

13%

25%

39%

20%

C3.5

Carefully attend to which partners hold the most and least power to influence implementation

0.85

0.72

2%

16%

32%

39%

11%

C3.6

Seek and gain buy-in from formal and informal leaders (e.g., champions, opinion leaders, or others potentially influencing the implementation because of their reputation or credibility) to include diverse expertise in team discussions

0.84

0.70

2%

13%

35%

37%

13%

C3.7

Support partners in developing an authentic and evolving shared understanding about implementation

0.87

0.76

1%

14%

35%

36%

14%

C4.1

Work with partners to build a strong fit between a selected intervention and its implementation

0.85

0.72

3%

8%

34%

39%

16%

C4.2

Support collaborative implementation planning involving all partners

0.87

0.76

1%

8%

30%

43%

18%

C4.3

To the extent possible, enable implementation partners to co-design any implementation tools, products, processes, governance structures, service models, strategies, and policies

0.80

0.65

4%

15%

31%

36%

14%

C4.4

Promote ongoing testing of implementation tools, products, and processes to improve them

0.77

0.59

3%

15%

30%

37%

17%

C4.5

Support the modification of specific implementation strategies based on local context

0.85

0.71

2%

12%

27%

40%

19%

C4.6

Facilitate activities that prioritize the needs of people who are intended to benefit from the intervention being implemented

0.83

0.70

2%

12%

30%

36%

20%

C5.1

Regularly assess the implementation support needs and assets of different partner groups

0.84

0.71

2%

15%

41%

29%

13%

C5.2

Facilitate agreement on the implementation supports that will be offered to different partner groups

0.84

0.71

2%

18%

37%

33%

10%

C5.3

Develop a plan for meetings and activities (virtual or onsite) based on the goals of implementation partners

0.80

0.64

2%

7%

23%

44%

25%

C5.4

Be responsive to “ad hoc”/ “just in time” support needs of implementation partners

0.76

0.58

1%

8%

25%

44%

23%

C5.5

Regularly assess whether your level of support matches the needs, goals, and context of implementation

0.78

0.60

1%

15%

35%

36%

13%

C5.6

Work with partners to tailor implementation strategies to meet local needs and assets

0.87

0.76

1%

10%

34%

40%

15%

C5.7

Continuously promote the adaptability of implementation strategies used by partners

0.86

0.75

1%

14%

33%

37%

16%

  • FL = standardized factor loading; Com. = item communality values. 1 = not at all competent, 2 = slightly competent, 3 = moderately competent, 4 = very competent, and 5 = extremely competent. The full measure is available via the Open Science Foundation repository with the following reference: Metz, A., Albers, B., & Jensen., T. (2024). Implementation Support Competencies Assessment (ISCA). https://doi.org/10.17605/OSF.IO/ZH82F

Implementation Support Competencies Assessment (ISCA) (Metz, Albers, & Jensen, 2024) – Items for the Ongoing Improvement Domain, Standardized Factor Loadings, Item Communalities, and Item Response Frequencies

 

Response Frequencies

#

Competency and Item

FL

Com.

1

2

3

4

5

 

C6.1

Collaborate with partners to identify the needs and assets of different individuals and groups involved in implementation

0.83

0.68

1%

8%

35%

41%

15%

C6.2

Engage people with lived experience to discover needs and assets

0.68

0.47

3%

16%

28%

35%

18%

C6.3

Facilitate the identification of relevant resources to be used in implementation

0.81

0.66

1%

10%

29%

44%

17%

C6.4

Support implementation partners to understand each other’s perspectives on the need for change

0.88

0.77

2%

10%

36%

39%

12%

C6.5

Use a variety of data sources to highlight needs and assets related to implementation

0.85

0.71

2%

11%

28%

37%

22%

C6.6

Use data to explore the unique needs of specific populations (e.g., race, ethnicity, gender, socioeconomic status, geography, ability status)

0.78

0.60

3%

13%

32%

32%

20%

C7.1

Involve diverse partners from throughout the system to identify and understand the implications of implementation

0.87

0.76

1%

15%

35%

35%

14%

C7.2

Review available evidence to determine the relevance and fit of the proposed intervention to be implemented

0.80

0.64

1%

10%

27%

41%

20%

C7.3

Assess the fit of the proposed intervention with the values, needs, and resources of the service setting

0.89

0.79

2%

11%

32%

38%

18%

C7.4

Assess the fit of the proposed intervention with the current political, financial, and organizational contexts

0.83

0.69

5%

14%

37%

32%

12%

C7.5

Continuously identify and respond to changes in the systems which affect implementation

0.88

0.77

3%

13%

34%

37%

13%

C7.6

Identify and support actions that manage risks and assumptions for implementation

0.88

0.78

5%

16%

40%

30%

8%

C8.1

Remain up to date on evidence developed through implementation research and practice

0.91

0.82

4%

15%

31%

35%

16%

C8.2

Remain up to date on knowledge about implementation frameworks, models, theories, and strategies

0.90

0.81

3%

18%

33%

32%

14%

C8.3

Educate partners about the best available evidence on implementation frameworks, strategies, and approaches that could be used to support implementation

0.86

0.74

7%

19%

32%

31%

10%

C8.4

Include all relevant partners in the selection, combination, and co-design of implementation strategies and approaches

0.93

0.86

6%

13%

37%

35%

9%

C8.5

In collaboration with partners, support the use of implementation frameworks, approaches, and strategies that are best suited for the specific service setting

0.93

0.87

6%

13%

35%

34%

12%

C9.1

Ensure that meetings and convenings to support implementation are welcoming and engaging for all participants

0.75

0.57

1%

7%

20%

51%

21%

C9.2

Support relevant partners in identifying barriers to implementation

0.88

0.77

1%

7%

24%

48%

21%

C9.3

Facilitate the identification of partners needed to develop and execute strategies for addressing barriers to implementation

0.89

0.80

2%

8%

31%

41%

18%

C9.4

Serve as a formal and informal facilitator as determined by an analysis of the implementation challenge and context

0.86

0.74

4%

12%

23%

41%

20%

C9.5

Support implementation partners to generate and prioritize ideas to address barriers to implementation

0.87

0.76

1%

9%

27%

44%

19%

C9.6

Support partners to evaluate alternatives, summarize key points, sort ideas, and exercise judgment in the face of simple challenges with easy solutions

0.88

0.77

2%

12%

29%

41%

16%

C9.7

Support partners to generate alternatives, facilitate open discussion, gather different points of view, and delay quick decision-making in the face of complex challenges with no easy solutions

0.87

0.76

3%

14%

32%

36%

15%

C9.8

Use facilitation methods (e.g., action planning, brainstorming, role playing, ranking, scenario development) that match the implementation challenge

0.80

0.63

3%

17%

25%

38%

17%

C9.9

Respond to emergent implementation challenges with flexibility and adaptability

0.87

0.76

2%

10%

27%

42%

19%

C10.1

Work with partners to develop communication protocols that facilitate engagement with each other.

0.95

0.90

4%

17%

32%

35%

12%

C10.2

Work with partners to develop communication protocols that communicate and celebrate implementation progress

0.94

0.89

4%

14%

35%

35%

12%

C10.3

Work with partners to develop communication protocols that report barriers hindering implementation

0.96

0.93

5%

19%

33%

32%

12%

C10.4

Work with partners to develop communication protocols that periodically review past decisions to continually assess their appropriateness

0.92

0.84

7%

21%

35%

26%

11%

C10.5

Support the development of tailored communication protocols for different audiences

0.88

0.77

3%

18%

31%

34%

14%

C10.6

Encourage partners to regularly communicate with and gather feedback from individuals inside and outside the implementing system

0.89

0.79

3%

16%

30%

35%

17%

C11.1

Facilitate the identification of relevant quantitative and qualitative data about implementation activities and outcomes

0.88

0.77

3%

12%

37%

26%

22%

C11.2

Support the development of processes and structures for the routine collection, analysis, and interpretation of implementation data

0.89

0.79

3%

17%

32%

31%

18%

C11.3

Ensure that different partners have access to relevant, valid, and reliable data to help guide implementation decision-making

0.90

0.80

3%

17%

30%

36%

15%

C11.4

Encourage the collection and use of data to explore the impact of implementation on different subgroups

0.88

0.77

3%

12%

31%

35%

20%

C11.5

Develop partners’ capacity to continuously use data for implementation decision-making through modeling, instruction, and coaching

0.90

0.81

4%

17%

31%

33%

15%

C11.6

Help create structures that ensure that crucial information about implementation and improvement is circulated among all partners

0.95

0.89

5%

19%

33%

30%

14%

Implementation Support Competencies Assessment (ISCA) (Metz, Albers, & Jensen, 2024) – Items for the Sustaining Change Domain, Standardized Factor Loadings, Item Communalities, and Item Response Frequencies

 

Response Frequencies

#

Competency and Item

FL

Com.

1

2

3

4

5

C12.1

Build trust with implementation partners by being transparent and accountable in all actions

0.87

0.75

1%

2%

19%

48%

29%

C12.2

Build relationships with implementation partners from all parts of the implementation setting

0.87

0.76

1%

5%

25%

46%

23%

C12.3

Continuously evaluate the strengths and weaknesses of your relationships with implementation partners

0.86

0.74

2%

14%

31%

39%

14%

C12.4

Seek and incorporate feedback from implementation partners about the strengths and weaknesses of your relationships with them

0.80

0.64

3%

21%

26%

36%

15%

C12.5

Facilitate open communication that enables difficult conversations with implementation partners, when needed, to regulate distress in your relationships with them

0.81

0.66

3%

12%

33%

36%

16%

C12.6

Demonstrate your competency to implementation partners

0.87

0.76

2%

9%

32%

43%

15%

C12.7

Enter the implementation setting with humility as a learner

0.73

0.53

5%

17%

46%

32%

0%

C12.8

Demonstrate commitment and persistence in the face of complex challenges

0.86

0.74

0%

4%

17%

47%

32%

C12.9

Encourage and enable implementation partners to share their perspectives openly and honestly

0.87

0.75

1%

4%

17%

51%

27%

C12.10

Normalize implementation challenges; ask questions; ask for support from implementation partners

0.89

0.78

1%

6%

19%

43%

32%

C12.11

Support implementation partners to understand each other’s perspective; highlight areas of shared understanding and common goals

0.90

0.81

1%

10%

23%

45%

21%

C13.1

Guide efforts to assemble implementation teams

0.83

0.70

2%

11%

36%

37%

15%

C13.2

Facilitate the development of clear governance structures for implementation teams

0.84

0.71

7%

21%

36%

27%

10%

C13.3

Support teams to select, operationalize, tailor, and adapt interventions

0.88

0.77

2%

14%

35%

37%

12%

C13.4

Support teams to develop operational processes and resources for building staff competency

0.87

0.76

3%

16%

37%

32%

13%

C13.5

Support teams to identify, collect, analyze, and monitor meaningful data

0.79

0.62

1%

14%

30%

36%

18%

C13.6

Support teams to engage leadership, staff, and partners in using data for improvement

0.85

0.72

2%

14%

28%

38%

18%

C13.7

Support teams to build capacity for sustained implementation

0.90

0.80

1%

14%

34%

38%

13%

C13.8

Support teams to build cross-sector collaborations that are aligned with new ways of work

0.80

0.64

3%

21%

34%

31%

11%

C13.9

Support teams to develop effective team meeting processes, including the establishment of consistent meeting schedules and standing agendas

0.78

0.61

1%

12%

31%

37%

19%

C13.10

Ensure implementation teams have sufficient support from organizational leadership to promote successful implementation

0.82

0.67

2%

18%

40%

31%

10%

C13.11

Help to develop communication protocols that ensure relevant information about implementation is circulated among implementation teams and their members

0.82

0.67

2%

18%

34%

36%

10%

C13.12

Develop processes for ongoing assessment and improvement of implementation team functioning

0.84

0.70

4%

21%

34%

32%

9%

C13.13

Support implementation teams in providing opportunities for learning and professional development to its members

0.85

0.73

2%

17%

32%

36%

13%

C13.14

Work to enhance cohesion and trust among implementation team members

0.84

0.71

1%

12%

31%

40%

16%

C13.15

Help manage and resolve conflict among implementation team members

0.81

0.65

4%

19%

35%

34%

9%

C14.1

At the outset of implementation, model with implementation partners the changes that will be implemented

0.86

0.74

5%

19%

33%

34%

10%

C14.2

Work with implementation partners to assess capacity for sustained implementation, including budget considerations

0.83

0.69

4%

23%

38%

25%

10%

C14.3

Facilitate implementation partners’ access to capacity-building training, modeling, or coaching for implementation

0.84

0.70

3%

16%

35%

32%

13%

C14.4

Model with implementation partners relevant knowledge, skills, behaviors, and practices

0.91

0.82

4%

11%

29%

41%

15%

C14.5

Coach implementation partners in their use of relevant knowledge, skills, behaviors, and practices

0.89

0.80

5%

14%

32%

34%

16%

C14.6

Help identify and shape organizational processes needed to build capacity for implementation

0.86

0.74

3%

19%

34%

32%

13%

C14.7

Support implementation partners in identifying and addressing future challenges or barriers to sustained implementation

0.87

0.76

3%

13%

36%

35%

13%

C14.8

Promote collaboration and new partnerships that will build the capacity of implementation partners

0.84

0.70

3%

11%

38%

34%

14%

C15.1

Identify existing leaders who can support implementation

0.85

0.72

1%

11%

27%

43%

18%

C15.2

Help build the capacity of leaders to lead implementation

0.92

0.85

4%

17%

30%

36%

14%

C15.3

Support partners in developing processes for regular coordination meetings with leaders related to implementation

0.92

0.84

3%

13%

32%

38%

13%

C15.4

Help identify and involve emerging leaders who can support implementation

0.88

0.78

3%

16%

31%

34%

16%

C15.5

Build the capacity of emerging leaders to support implementation

0.92

0.84

4%

16%

32%

34%

14%

C15.6

Support implementation partners in navigating any transitions in organizational leadership

0.85

0.72

7%

23%

32%

30%

7%

C15.7

Support implementation partners in identifying champions who can support implementation (champions are professionals or lay persons who volunteer or are appointed to enthusiastically promote and support the implementation of an innovation)

0.91

0.83

3%

15%

29%

39%

15%

C15.8

Support implementation partners in involving champions throughout the course of implementation

0.94

0.89

3%

15%

28%

41%

13%

C15.9

Support implementation partners in reviewing and strengthening champion roles

0.93

0.86

6%

17%

29%

38%

10%

  • FL = standardized factor loading; Com. = item communality values. 1 = not at all competent, 2 = slightly competent, 3 = moderately competent, 4 = very competent, and 5 = extremely competent. The full measure is available via the Open Science Foundation repository with the following reference: Metz, A., Albers, B., & Jensen., T. (2024). Implementation Support Competencies Assessment (ISCA).  https://doi.org/10.17605/OSF.IO/ZH82F

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/ .

Reprints and permissions

About this article

Cite this article.

Jensen, T.M., Metz, A.J. & Albers, B. Development and psychometric evaluation of the Implementation Support Competencies Assessment. Implementation Sci 19 , 58 (2024). https://doi.org/10.1186/s13012-024-01390-8

Download citation

Received : 11 April 2024

Accepted : 01 August 2024

Published : 06 August 2024

DOI : https://doi.org/10.1186/s13012-024-01390-8

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Implementation Science

ISSN: 1748-5908

  • Submission enquiries: Access here and click Contact Us
  • General enquiries: [email protected]

a measurement in research

New therapy for glioma receives FDA approval

Headshots of Drs. Bigner, Yan, and Peters

Duke brain tumor researchers are part of earliest collaborations that led to the development of the drug, shown to more than double progression-free survival

The FDA has approved a new targeted drug specifically for brain tumors called low-grade gliomas. The drug, vorasidenib, was shown in clinical trials to delay progression of low-grade gliomas that had mutations in the IDH1 or IDH2 genes.

“Although there have been other targeted therapies for the treatment of brain tumors with the IDH mutation, [this one] has been one of the most successful in survival prolongation of brain tumor patients,” said Darell Bigner, MD, PhD, the E. L. and Lucille F. Jones Cancer Distinguished Research Professor and founding director of the Preston Robert Tisch Brain Tumor Center at Duke.

In clinical trials , progression-free survival was estimated to be 27.7 months for people in the vorasidenib group versus 11.1 months for those in the placebo group.

Bigner, Katherine Peters, MD, PhD, professor of neurology and neurosurgery, and others at the Duke Brain Tumor Center played pivotal roles in the development and approval of the drug: Bigner in the early collaborations with Johns Hopkins University that led to the discovery of the IDH mutation, and Peters, more recently, as lead investigator in the clinical trials.

Patents developed from the early collaborations were licensed to industry through the Duke University Office for Translation & Commercialization, making this the seventh drug currently on the market with Duke intellectual property roots.

Drs. Bigner and Peters as well as Hai Yan, MD, PhD (formerly the Henry S. Friedman Distinguished Professor of Neuro-Oncology at Duke) answered questions about the work that led to vorasidenib.

How did the discovery of the IDH gene mutation contribute to the overall understanding of brain cancer?

Bigner: The discovery of the mutant IDH gene is one of the most important discoveries in neuro-oncology. The IDH mutation has been incorporated by the World Health Organization into the rapid and accurate diagnosis and classification of astrocytic, oligodendroglial, and glioblastoma multiforme brain tumors. Never before has there been a single gene mutation that contributed so greatly to classification. Most importantly, it was immediately recognized that the IDH mutation could be targeted with drugs to treat the group of patients that had malignant brain tumors that expressed the IDH mutation.

Describe the collaboration between Johns Hopkins and Duke that led up to the development of this drug.

Bigner: Perhaps the most important collaboration between Johns Hopkins and Duke came in the work that led to the discovery of the IDH mutation. The National Cancer Institute had established a program in which genome sequencing of all the major cancers was to be done and decided that glioblastoma would be the first cancer that they investigated. The Johns Hopkins and Duke group decided to also perform complete genome sequencing of glioblastoma. The NCI did not do complete genome sequencing. Using the Duke material, the Johns Hopkins group sequenced the entire genome that could be done at that time, in 2008. The sequencing at that time was very laborious, rather than in the automated manner that can be done now. By doing almost complete genome sequencing, the Johns Hopkins and Duke group discovered the IDH mutation. The collaboration with Johns Hopkins was strengthened when [we] recruited Dr. Hai Yan [to Duke] in 2003. Dr. Yan had just completed a 5-year period as a post-doctoral research fellow at Johns Hopkins with Dr. Bert Vogelstein in Cancer Molecular Genetics.

Yan: Subsequent research from these teams produced numerous publications that further elucidated the pathological roles of IDH mutations, leading to the reclassification of gliomas in the WHO CNS classification. This body of work ultimately paved the way for the development of targeted therapies, culminating in the approval of [vorasidenib]. This collaboration … exemplifies the power of interdisciplinary and inter-institutional cooperation in driving scientific discovery and innovation in cancer treatment.

Can you briefly explain mechanism of action of vorasidenib?

Yan: Mutations in the IDH1 or IDH2 genes result in elevated levels of the oncometabolite D-2HG, disrupting normal cellular functions and contributing to tumorigenesis. Vorasidenib selectively binds to the mutated IDH1 and IDH2 enzymes, inhibiting their activity and thereby reducing the production of D-2HG. This inhibition helps to restore normal cellular processes, reduce tumor cell proliferation, and promote the differentiation of cancer cells.

How does the development and approval of vorasidenib affect the broader future of cancer research and treatment? Are there plans to study vorasidenib in combination with other treatments or in different types of brain cancers?

Yan: The development and approval of vorasidenib represent a significant milestone in the field of oncology, particularly in the treatment of brain cancers. It validates the approach of targeting specific genetic mutations with precision therapies and reinforces the importance of personalized medicine in oncology. This success is likely to inspire further research into targeting other genetic mutations and metabolic pathways in various cancers.

Bigner: There are indeed plans to explore the potential of vorasidenib beyond its current indications. Researchers are investigating its use in combination with other therapies, such as immune checkpoint inhibitors, to enhance therapeutic efficacy. Additionally, studies are being planned or are already under way to assess the effectiveness of vorasidenib in treating other types of brain cancer s, solid tumors and leukemia with IDH mutations. The ongoing research aims to expand the therapeutic applications of vorasidenib and optimize its use in various clinical settings, potentially benefiting a broader spectrum of cancer patients.

What the Clinical Research Showed 

What were the outcomes of the clinical trial for vorasidenib.

Peters: The INDIGO clinical trial was a phase 3 trial of vorasidenib, an oral inhibitor of mutant IDH1/2 that can readily cross the blood-brain barrier, versus placebo in patients with mutant IDH1/2 glioma. Treatment with vorasidenib significantly improved progression-free survival (27.7 months vorasidenib vs. 11.1 for placebo). 

The key secondary endpoint was time to next intervention, which means the time to needing chemotherapy, radiation therapy, or more surgery. For patients receiving placebo, the median time to next intervention was 17.8 months, but for patients receiving vorasidenib, the median time to next intervention has not yet been reached. Thus, patients on vorasidenib could significantly delay chemotherapy, radiation therapy, or more surgery. Most importantly, the vorasidenib was well tolerated with only 3.6% of patients needing to stop the drug because of an adverse event.  

What about quality of life for patients on the trial?

Peters: Results showed that throughout the study, patients with IDH mutant low grade glioma had a good quality of life, and it was preserved throughout the study. Patients on vorasidenib were able to maintain their cognitive abilities and did not have any decline in their quality of life or cognition.

How might this drug influence future research and development in neuro-oncology?

Peters: At Duke, we are conducting studies of vorasidenib on patients with high-grade tumors and enhancing disease. Most of these studies look at combining vorasidenib with immunotherapy. It will be exciting to see what will happen with the INDIGO study's long-term outcomes.

What does the approval of vorasidenib mean for the treatment landscape of low-grade gliomas?

Peters : It is exciting to have a drug specifically targeted for these patients by inhibiting the mutant IDH enzyme. With vorasidenib being orally available, well-tolerated, and does not impair quality of life or cognition, we can extend people’s lives and delay the use of treatments such as radiation therapy and chemotherapy. I am so thankful to all the patients who participated in the groundbreaking study and for paving the way for future patients.  

IMAGES

  1. ️ Measurement in research. Quantitative Scales of Measurement. 2019-02-15

    a measurement in research

  2. Levels of Measurement: "Nominal Ordinal Interval Ratio" Scales

    a measurement in research

  3. Levels of Measurement

    a measurement in research

  4. Scales of Measurement

    a measurement in research

  5. The measurement items of all research constructs

    a measurement in research

  6. Measurement in research

    a measurement in research

COMMENTS

  1. Measurement

    Measurement is the process of observing and recording the observations that are collected as part of a research effort. There are two major issues that will be considered here. First, you have to understand the fundamental ideas involved in measuring. Here we consider two of major measurement concepts. In Levels of Measurement, I explain the ...

  2. PDF What Is Measurement?

    be done to measure something, whether measuring brain activity, attitude toward an object, organizational emphasis on research and development, or stock market performance. Therefore, these rules include a range of things that occur during the data collection process, such as how questions are worded and how a measure is administered.

  3. (Pdf) Measurement in Research

    Most of the measurements in Psychology a re on the interval scale. e.g. the Likert scale, RATIO MEASUREMENT. This is a further refinement in the measurement levels in that it provides us with ...

  4. Concept and Principles of Measurement

    The importance of measurement in research and technology is indisputable. Measurement is the fundamental mechanism of scientific study and development, and it allows to describe the different phenomena of the universe through the exact and general language of mathematics, without which it would be challenging to define practical or theoretical approaches from scientific investigation.

  5. Measurement in Nursing Research : AJN The American Journal of Nursing

    Ratio level data provide the final and most robust level of measurement. Ratio level data are measured continuously, with equal spacing between intervals and with a true zero. Examples include height, weight, heart rate, and serum laboratory values. A zero value is interpreted as the absence of the characteristic.

  6. 10.1 What is measurement?

    In research, measurement is a systematic procedure for assigning scores, meanings, and descriptions to concepts so that those scores represent the characteristic of interest. Social scientists can and do measure just about anything you can imagine observing or wanting to study. Of course, some things are easier to observe or measure than others.

  7. Data, measurement and empirical methods in the science of science

    Liu and coauthors review the major data sources, measures and analysis methods in the science of science, discussing how recent developments in these fields can help researchers to better predict ...

  8. Measurement: The Basic Building Block of Research

    Measurement in science begins with the activity of distinguishing groups or phenomena from one another. This process, which is generally termed classification, implies that we can place units of scientific study—such as victims, offenders, crimes, or crime places—in clearly defined categories or along some continuum.

  9. What, what for and how? Developing measurement instruments in

    RESEARCH SCENARIOS AND INSTRUMENT DEVELOPMENT OR ADAPTATION. Epidemiological studies require well-defined and socially relevant research questions, which, in turn, demand reliable and accurate measurements of the phenomena and concepts needed to answer them 8.Berry et al. 9 discuss three perspectives that are particularly relevant for the issues at hand.

  10. Measurement Issues in Quantitative Research

    Measurement is central to empirical research whether observational or experimental. A study of a novel, well-defined research question can fall apart due to inappropriate measurement. Measurement is defined in a variety of ways (Last 2001; Thorndike 2007; Manoj and Lingyak 2014 ), yet common to all definitions is the systematic application of ...

  11. Reliability and Validity of Measurement

    Reliability is consistency across time (test-retest reliability), across items (internal consistency), and across researchers (interrater reliability). Validity is the extent to which the scores actually represent the variable they are intended to. Validity is a judgment based on various types of evidence.

  12. 5.1 Understanding Psychological Measurement

    Define measurement and give several examples of measurement in psychology. ... This is why the research literature often includes different conceptual definitions of the same construct. In some cases, an older conceptual definition has been replaced by a newer one that works better. In others, researchers are still in the process of deciding ...

  13. Measurement

    The dedicated Measurement journal special issue is featuring selected papers from the IMEKO TC-4 2023 Symposium in Pordenone, Italy. This curated collection offers a comprehensive glimpse into the latest research and innovations in measurement …. Submission deadline: 03 March 2024.

  14. Levels of Measurement

    In scientific research, a variable is anything that can take on different values across your data set (e.g., height or test scores). There are 4 levels of measurement: Nominal:the data can only be categorized. Ordinal:the data can be categorized and ranked. Interval:the data can be categorized, ranked, and evenly spaced.

  15. 5.2 Reliability and Validity of Measurement

    Reliability is consistency across time (test-retest reliability), across items (internal consistency), and across researchers (interrater reliability). Validity is the extent to which the scores actually represent the variable they are intended to. Validity is a judgment based on various types of evidence.

  16. Measurement

    Measurement is a cornerstone of trade, science, technology and quantitative research in many disciplines. Historically, many measurement systems existed for the varied fields of human existence to facilitate comparisons in these fields.

  17. Measurements in quantitative research: how to select and ...

    Quantitative research is based on measurement and is conducted in a systematic, controlled manner. These measures enable researchers to perform statistical tests, analyze differences between groups, and determine the effectiveness of treatments. If something is not measurable, it cannot be tested. Keywords: measurements; quantitative research ...

  18. Measurements in Quantitative Research: How to Select and Report ...

    Measures exist to numerically represent degrees of attributes. Quantitative research is based on measurement and is conducted in a systematic, controlled manner. These measures enable researchers to perform statistical tests, analyze differences between groups, and determine the effectiveness of treatments. If something is not measurable, it ...

  19. Introduction to measurement and indicators

    Indicators: A way to measure and monitor a given milestone, outcome, or construct and help determine if our assumptions are correct. Examples include math test scores, reported burglaries, or daily wages. The relation between constructs, indicators and data is illustrated below: 1. Constructs, indicators, and data.

  20. Measurement Schmeasurement: Questionable Measurement Practices and How

    Construct validation—collecting evidence that the instruments scientists build actually measure the constructs scientists claim they measure—is a difficult and necessary part of the research process (Cronbach & Meehl, 1955). It has taken many programs of research, thousands of studies, and decades of work to identify, define, and measure ...

  21. Levels of Measurement

    There are typically four levels of measurement that are defined: Nominal. Ordinal. Interval. Ratio. In nominal measurement the numerical values just "name" the attribute uniquely. No ordering of the cases is implied. For example, jersey numbers in basketball are measures at the nominal level.

  22. (PDF) Scales of Measurement in Research

    In order to analyse data, the variables have to be defined and categorised using. different scales of measurements. There are four scales of measurements- nominal scale, ordinal scale, interval ...

  23. Measurement Science and Technology

    SUPPORTS OPEN ACCESS. Launched in 1923 Measurement Science and Technology was the world's first scientific instrumentation and measurement journal and the first research journal produced by the Institute of Physics. It covers all aspects of the theory, practice and application of measurement, instrumentation and sensing across science and engineering.

  24. What Is Measurement? Scales, Types, Criteria And Developing Measurement

    3.4 Formation of Index. 4 Criteria of Good Measurement Tool. 4.1 Reliability. 4.2 Validity. In simple words, measurement means using a yardstick to determine the characteristics of a physical object. In addition to physical objects, qualitative concepts, such as songs and paintings, or an abstract phenomenon, can also be measured.

  25. Tests and Measures

    The term "tests and measures" refers to tools of measurement for analytical and diagnostic purposes. In the field of social work this might refer to a survey, instrument, diagnostic test, exam, questionnaire, survey, and/or measure. ... Resources for finding the tests and measures relevant to your topic of research are available below. Books ...

  26. Meaning and Measurement in Comparative Housing Research

    The last two decades have seen a marked growth in comparative research within the field of housing studies. This reflects the increasing globalisation of housing finance and therefore the interconnectedness of housing markets, growing interest among researchers and policy makers in learning from developments in other countries and the availability of more funding and better comparative data to ...

  27. Development and psychometric evaluation of the Implementation Support

    Measurement development process. Our process of developing the ISCA was informed by DeVellis [], whereby we engaged in a systematic and rigorous process of measurement development.To begin, we leveraged recent scholarship that offers clear and rich descriptions of the constructs intended for measurement—the 15 core competencies posited to undergird effective implementation support [1,2,3,4,5].

  28. Site Master Handheld Cable And Antenna Analyzer with Spectrum Analyzer

    The Next Generation Anritsu Site Master represents a fusion of cutting-edge technology, customer-driven innovation, and decades of expertise in test and measurement solutions. It seamlessly integrates cable and antenna analysis with spectrum analysis and monitoring functionalities, offering a comprehensive testing solution for professionals ...

  29. New therapy for glioma receives FDA approval

    Duke brain tumor researchers are part of earliest collaborations that led to the development of the drug, shown to more than double progression-free survival The FDA has approved a new targeted drug specifically for brain tumors called low-grade gliomas. The drug, vorasidenib, was shown in clinical trials to delay progression of low-grade gliomas that had mutations in the IDH1 or IDH2 genes.