Statistical Methods for Data Analysis: a Comprehensive Guide

In today’s data-driven world, understanding statistical methods for data analysis is like having a superpower.

Whether you’re a student, a professional, or just a curious mind, diving into the realm of data can unlock insights and decisions that propel success.

Statistical methods for data analysis are the tools and techniques used to collect, analyze, interpret, and present data in a meaningful way.

From businesses optimizing operations to researchers uncovering new discoveries, these methods are foundational to making informed decisions based on data.

In this blog post, we’ll embark on a journey through the fascinating world of statistical analysis, exploring its key concepts, methodologies, and applications.

Introduction to Statistical Methods

At its core, statistical methods are the backbone of data analysis, helping us make sense of numbers and patterns in the world around us.

Whether you’re looking at sales figures, medical research, or even your fitness tracker’s data, statistical methods are what turn raw data into useful insights.

But before we dive into complex formulas and tests, let’s start with the basics.

Data comes in two main types: qualitative and quantitative data .

Qualitative vs Quantitative Data - a simple infographic

Quantitative data is all about numbers and quantities (like your height or the number of steps you walked today), while qualitative data deals with categories and qualities (like your favorite color or the breed of your dog).

And when we talk about measuring these data points, we use different scales like nominal, ordinal , interval , and ratio.

These scales help us understand the nature of our data—whether we’re ranking it (ordinal), simply categorizing it (nominal), or measuring it with a true zero point (ratio).

Scales of Data Measurement - an infographic

In a nutshell, statistical methods start with understanding the type and scale of your data.

This foundational knowledge sets the stage for everything from summarizing your data to making complex predictions.

Descriptive Statistics: Simplifying Data

What is Descriptive Statistics - an infographic

Imagine you’re at a party and you meet a bunch of new people.

When you go home, your roommate asks, “So, what were they like?” You could describe each person in detail, but instead, you give a summary: “Most were college students, around 20-25 years old, pretty fun crowd!”

That’s essentially what descriptive statistics does for data.

It summarizes and describes the main features of a collection of data in an easy-to-understand way. Let’s break this down further.

The Basics: Mean, Median, and Mode

  • Mean is just a fancy term for the average. If you add up everyone’s age at the party and divide by the number of people, you’ve got your mean age.
  • Median is the middle number in a sorted list. If you line up everyone from the youngest to the oldest and pick the person in the middle, their age is your median. This is super handy when someone’s age is way off the chart (like if your grandma crashed the party), as it doesn’t skew the data.
  • Mode is the most common age at the party. If you notice a lot of people are 22, then 22 is your mode. It’s like the age that wins the popularity contest.

Spreading the News: Range, Variance, and Standard Deviation

  • Range gives you an idea of how spread out the ages are. It’s the difference between the oldest and the youngest. A small range means everyone’s around the same age, while a big range means a wider variety.
  • Variance is a bit more complex. It measures how much the ages differ from the average age. A higher variance means ages are more spread out.
  • Standard Deviation is the square root of variance. It’s like variance but back on a scale that makes sense. It tells you, on average, how far each person’s age is from the mean age.

Picture Perfect: Graphical Representations

  • Histograms are like bar charts showing how many people fall into different age groups. They give you a quick glance at how ages are distributed.
  • Bar Charts are great for comparing different categories, like how many men vs. women were at the party.
  • Box Plots (or box-and-whisker plots) show you the median, the range, and if there are any outliers (like grandma).
  • Scatter Plots are used when you want to see if there’s a relationship between two things, like if bringing more snacks means people stay longer at the party.

Why Descriptive Statistics Matter?

Descriptive statistics are your first step in data analysis.

They help you understand your data at a glance and prepare you for deeper analysis.

Without them, you’re like someone trying to guess what a party was like without any context.

Whether you’re looking at survey responses, test scores, or party attendees, descriptive statistics give you the tools to summarize and describe your data in a way that’s easy to grasp.

This approach is crucial in educational settings, particularly for enhancing math learning outcomes. For those looking to deepen their understanding of math or seeking additional support, check out this link:  https://www.mathnasium.com/ math-tutors-near-me .

Remember, the goal of descriptive statistics is to simplify the complex.

Inferential Statistics: Beyond the Basics

Statistics Seminar Illustration

Let’s keep the party analogy rolling, but this time, imagine you couldn’t attend the party yourself.

You’re curious if the party was as fun as everyone said it would be.

Instead of asking every single attendee, you decide to ask a few friends who went.

Based on their experiences, you try to infer what the entire party was like.

This is essentially what inferential statistics does with data.

It allows you to make predictions or draw conclusions about a larger group (the population) based on a smaller group (a sample). Let’s dive into how this works.

Probability

Inferential statistics is all about playing the odds.

When you make an inference, you’re saying, “Based on my sample, there’s a certain probability that my conclusion about the whole population is correct.”

It’s like betting on whether the party was fun, based on a few friends’ opinions.

The Central Limit Theorem (CLT)

The Central Limit Theorem is the superhero of statistics.

It tells us that if you take enough samples from a population, the sample means (averages) will form a normal distribution (a bell curve), no matter what the population distribution looks like.

This is crucial because it allows us to use sample data to make inferences about the population mean with a known level of uncertainty.

Confidence Intervals

Imagine you’re pretty sure the party was fun, but you want to know how fun.

A confidence interval gives you a range of values within which you believe the true mean fun level of the party lies.

It’s like saying, “I’m 95% confident the party’s fun rating was between 7 and 9 out of 10.”

Hypothesis Testing

This is where you get to be a bit of a detective. You start with a hypothesis (a guess) about the population.

For example, your null hypothesis might be “the party was average fun.” Then you use your sample data to test this hypothesis.

If the data strongly suggests otherwise, you might reject the null hypothesis and accept the alternative hypothesis, which could be “the party was super fun.”

The p-value tells you how likely it is that your data would have occurred by random chance if the null hypothesis were true.

A low p-value (typically less than 0.05) indicates that your findings are significant—that is, unlikely to have happened by chance.

It’s like saying, “The chance that all my friends are exaggerating about the party being fun is really low, so the party probably was fun.”

Why Inferential Statistics Matter?

Inferential statistics let us go beyond just describing our data.

They allow us to make educated guesses about a larger population based on a sample.

This is incredibly useful in almost every field—science, business, public health, and yes, even planning your next party.

By using probability, the Central Limit Theorem, confidence intervals, hypothesis testing, and p-values, we can make informed decisions without needing to ask every single person in the population.

It saves time, resources, and helps us understand the world more scientifically.

Remember, while inferential statistics gives us powerful tools for making predictions, those predictions come with a level of uncertainty.

Being a good data scientist means understanding and communicating that uncertainty clearly.

So next time you hear about a party you missed, use inferential statistics to figure out just how much FOMO (fear of missing out) you should really feel!

Common Statistical Tests: Choosing Your Data’s Best Friend

Data Analysis Research and Statistics Concept

Alright, now that we’ve covered the basics of descriptive and inferential statistics, it’s time to talk about how we actually apply these concepts to make sense of data.

It’s like deciding on the best way to find out who was the life of the party.

You have several tools (tests) at your disposal, and choosing the right one depends on what you’re trying to find out and the type of data you have.

Let’s explore some of the most common statistical tests and when to use them.

T-Tests: Comparing Averages

Imagine you want to know if the average fun level was higher at this year’s party compared to last year’s.

A t-test helps you compare the means (averages) of two groups to see if they’re statistically different.

There are a couple of flavors:

  • Independent t-test : Use this when comparing two different groups, like this year’s party vs. last year’s party.
  • Paired t-test : Use this when comparing the same group at two different times or under two different conditions, like if you measured everyone’s fun level before and after the party.

ANOVA : When Three’s Not a Crowd.

But what if you had three or more parties to compare? That’s where ANOVA (Analysis of Variance) comes in handy.

It lets you compare the means across multiple groups at once to see if at least one of them is significantly different.

It’s like comparing the fun levels across several years’ parties to see if one year stood out.

Chi-Square Test: Categorically Speaking

Now, let’s say you’re interested in whether the type of music (pop, rock, electronic) affects party attendance.

Since you’re dealing with categories (types of music) and counts (number of attendees), you’ll use the Chi-Square test.

It’s great for seeing if there’s a relationship between two categorical variables.

Correlation and Regression: Finding Relationships

What if you suspect that the amount of snacks available at the party affects how long guests stay? To explore this, you’d use:

  • Correlation analysis to see if there’s a relationship between two continuous variables (like snacks and party duration). It tells you how closely related two things are.
  • Regression analysis goes a step further by not only showing if there’s a relationship but also how one variable predicts the other. It’s like saying, “For every extra bag of chips, guests stay an average of 10 minutes longer.”

Non-parametric Tests: When Assumptions Don’t Hold

All the tests mentioned above assume your data follows a normal distribution and meets other criteria.

But what if your data doesn’t play by these rules?

Enter non-parametric tests, like the Mann-Whitney U test (for comparing two groups when you can’t use a t-test) or the Kruskal-Wallis test (like ANOVA but for non-normal distributions).

Picking the Right Test

Choosing the right statistical test is crucial and depends on:

  • The type of data you have (categorical vs. continuous).
  • Whether you’re comparing groups or looking for relationships.
  • The distribution of your data (normal vs. non-normal).

Why These Tests Matter?

Just like you’d pick the right tool for a job, selecting the appropriate statistical test helps you make valid and reliable conclusions about your data.

Whether you’re trying to prove a point, make a decision, or just understand the world a bit better, these tests are your gateway to insights.

By mastering these tests, you become a detective in the world of data, ready to uncover the truth behind the numbers!

Regression Analysis: Predicting the Future

Regression Analysis

Ever wondered if you could predict how much fun you’re going to have at a party based on the number of friends going, or how the amount of snacks available might affect the overall party vibe?

That’s where regression analysis comes into play, acting like a crystal ball for your data.

What is Regression Analysis?

Regression analysis is a powerful statistical method that allows you to examine the relationship between two or more variables of interest.

Think of it as detective work, where you’re trying to figure out if, how, and to what extent certain factors (like snacks and music volume) predict an outcome (like the fun level at a party).

The Two Main Characters: Independent and Dependent Variables

  • Independent Variable(s): These are the predictors or factors that you suspect might influence the outcome. For example, the quantity of snacks.
  • Dependent Variable: This is the outcome you’re interested in predicting. In our case, it could be the fun level of the party.

Linear Regression: The Straight Line Relationship

The most basic form of regression analysis is linear regression .

It predicts the outcome based on a linear relationship between the independent and dependent variables.

If you plot this on a graph, you’d ideally see a straight line where, as the amount of snacks increases, so does the fun level (hopefully!).

  • Simple Linear Regression involves just one independent variable. It’s like saying, “Let’s see if just the number of snacks can predict the fun level.”
  • Multiple Linear Regression takes it up a notch by including more than one independent variable. Now, you’re looking at whether the quantity of snacks, type of music, and number of guests together can predict the fun level.

Logistic Regression: When Outcomes are Either/Or

Not all predictions are about numbers.

Sometimes, you just want to know if something will happen or not—will the party be a hit or a flop?

Logistic regression is used for these binary outcomes.

Instead of predicting a precise fun level, it predicts the probability of the party being a hit based on the same predictors (snacks, music, guests).

Making Sense of the Results

  • Coefficients: In regression analysis, each predictor has a coefficient, telling you how much the dependent variable is expected to change when that predictor changes by one unit, all else being equal.
  • R-squared : This value tells you how much of the variation in your dependent variable can be explained by the independent variables. A higher R-squared means a better fit between your model and the data.

Why Regression Analysis Rocks?

Regression analysis is like having a superpower. It helps you understand which factors matter most, which can be ignored, and how different factors come together to influence the outcome.

This insight is invaluable whether you’re planning a party, running a business, or conducting scientific research.

Bringing It All Together

Imagine you’ve gathered data on several parties, including the number of guests, type of music, and amount of snacks, along with a fun level rating for each.

By running a regression analysis, you can start to predict future parties’ success, tailoring your planning to maximize fun.

It’s a practical tool for making informed decisions based on past data, helping you throw legendary parties, optimize business strategies, or understand complex relationships in your research.

In essence, regression analysis helps turn your data into actionable insights, guiding you towards smarter decisions and better predictions.

So next time you’re knee-deep in data, remember: regression analysis might just be the key to unlocking its secrets.

Non-parametric Methods: Playing By Different Rules

So far, we’ve talked a lot about statistical methods that rely on certain assumptions about your data, like it being normally distributed (forming that classic bell curve) or having a specific scale of measurement.

But what happens when your data doesn’t fit these molds?

Maybe the scores from your last party’s karaoke contest are all over the place, or you’re trying to compare the popularity of various party games but only have rankings, not scores.

This is where non-parametric methods come to the rescue.

Breaking Free from Assumptions

Non-parametric methods are the rebels of the statistical world.

They don’t assume your data follows a normal distribution or that it meets strict requirements regarding measurement scales.

These methods are perfect for dealing with ordinal data (like rankings), nominal data (like categories), or when your data is skewed or has outliers that would throw off other tests.

When to Use Non-parametric Methods?

  • Your data is not normally distributed, and transformations don’t help.
  • You have ordinal data (like survey responses that range from “Strongly Disagree” to “Strongly Agree”).
  • You’re dealing with ranks or categories rather than precise measurements.
  • Your sample size is small, making it hard to meet the assumptions required for parametric tests.

Some Popular Non-parametric Tests

  • Mann-Whitney U Test: Think of it as the non-parametric counterpart to the independent samples t-test. Use this when you want to compare the differences between two independent groups on a ranking or ordinal scale.
  • Kruskal-Wallis Test: This is your go-to when you have three or more groups to compare, and it’s similar to an ANOVA but for ranked/ordinal data or when your data doesn’t meet ANOVA’s assumptions.
  • Spearman’s Rank Correlation: When you want to see if there’s a relationship between two sets of rankings, Spearman’s got your back. It’s like Pearson’s correlation for continuous data but designed for ranks.
  • Wilcoxon Signed-Rank Test: Use this for comparing two related samples when you can’t use the paired t-test, typically because the differences between pairs are not normally distributed.

The Beauty of Flexibility

The real charm of non-parametric methods is their flexibility.

They let you work with data that’s not textbook perfect, which is often the case in the real world.

Whether you’re analyzing customer satisfaction surveys, comparing the effectiveness of different marketing strategies, or just trying to figure out if people prefer pizza or tacos at parties, non-parametric tests provide a robust way to get meaningful insights.

Keeping It Real

It’s important to remember that while non-parametric methods are incredibly useful, they also come with their own limitations.

They might be more conservative, meaning you might need a larger effect to detect a significant result compared to parametric tests.

Plus, because they often work with ranks rather than actual values, some information about your data might get lost in translation.

Non-parametric methods are your statistical toolbox’s Swiss Army knife, ready to tackle data that doesn’t fit into the neat categories required by more traditional tests.

They remind us that in the world of data analysis, there’s more than one way to uncover insights and make informed decisions.

So, the next time you’re faced with skewed distributions or rankings instead of scores, remember that non-parametric methods have got you covered, offering a way to navigate the complexities of real-world data.

Data Cleaning and Preparation: The Unsung Heroes of Data Analysis

Before any party can start, there’s always a bit of housecleaning to do—sweeping the floors, arranging the furniture, and maybe even hiding those laundry piles you’ve been ignoring all week.

Similarly, in the world of data analysis, before we can dive into the fun stuff like statistical tests and predictive modeling, we need to roll up our sleeves and get our data nice and tidy.

This process of data cleaning and preparation might not be the most glamorous part of data science, but it’s absolutely critical.

Let’s break down what this involves and why it’s so important.

Why Clean and Prepare Data?

Imagine trying to analyze party RSVPs when half the responses are “yes,” a quarter are “Y,” and the rest are a creative mix of “yup,” “sure,” and “why not?”

Without standardization, it’s hard to get a clear picture of how many guests to expect.

The same goes for any data set. Cleaning ensures that your data is consistent, accurate, and ready for analysis.

Preparation involves transforming this clean data into a format that’s useful for your specific analysis needs.

The Steps to Sparkling Clean Data

  • Dealing with Missing Values: Sometimes, data is incomplete. Maybe a survey respondent skipped a question, or a sensor failed to record a reading. You’ll need to decide whether to fill in these gaps (imputation), ignore them, or drop the observations altogether.
  • Identifying and Handling Outliers: Outliers are data points that are significantly different from the rest. They might be errors, or they might be valuable insights. The challenge is determining which is which and deciding how to handle them—remove, adjust, or analyze separately.
  • Correcting Inconsistencies: This is like making sure all your RSVPs are in the same format. It could involve standardizing text entries, correcting typos, or converting all measurements to the same units.
  • Formatting Data: Your analysis might require data in a specific format. This could mean transforming data types (e.g., converting dates into a uniform format) or restructuring data tables to make them easier to work with.
  • Reducing Dimensionality: Sometimes, your data set might have more information than you actually need. Reducing dimensionality (through methods like Principal Component Analysis) can help simplify your data without losing valuable information.
  • Creating New Variables: You might need to derive new variables from your existing ones to better capture the relationships in your data. For example, turning raw survey responses into a numerical satisfaction score.

The Tools of the Trade

There are many tools available to help with data cleaning and preparation, ranging from spreadsheet software like Excel to programming languages like Python and R.

These tools offer functions and libraries specifically designed to make data cleaning as painless as possible.

Why It Matters

Skipping the data cleaning and preparation stage is like trying to cook without prepping your ingredients first.

Sure, you might end up with something edible, but it’s not going to be as good as it could have been.

Clean and well-prepared data leads to more accurate, reliable, and meaningful analysis results.

It’s the foundation upon which all good data analysis is built.

Data cleaning and preparation might not be the flashiest part of data science, but it’s where all successful data analysis projects begin.

By taking the time to thoroughly clean and prepare your data, you’re setting yourself up for clearer insights, better decisions, and, ultimately, more impactful outcomes.

Software Tools for Statistical Analysis: Your Digital Assistants

Diving into the world of data without the right tools can feel like trying to cook a gourmet meal without a kitchen.

Just as you need pots, pans, and a stove to create a culinary masterpiece, you need the right software tools to analyze data and uncover the insights hidden within.

These digital assistants range from user-friendly applications for beginners to powerful suites for the pros.

Let’s take a closer look at some of the most popular software tools for statistical analysis.

R and RStudio: The Dynamic Duo

  • R is like the Swiss Army knife of statistical analysis. It’s a programming language designed specifically for data analysis, graphics, and statistical modeling. Think of R as the kitchen where you’ll be cooking up your data analysis.
  • RStudio is an integrated development environment (IDE) for R. It’s like having the best kitchen setup with organized countertops (your coding space) and all your tools and ingredients within reach (packages and datasets).

Why They Rock:

R is incredibly powerful and can handle almost any data analysis task you throw at it, from the basics to the most advanced statistical models.

Plus, there’s a vast community of users, which means a wealth of tutorials, forums, and free packages to add on.

Python with pandas and scipy: The Versatile Virtuoso

  • Python is not just for programming; with the right libraries, it becomes an excellent tool for data analysis. It’s like a kitchen that’s not only great for baking but also equipped for gourmet cooking.
  • pandas is a library that provides easy-to-use data structures and data analysis tools for Python. Imagine it as your sous-chef, helping you to slice and dice data with ease.
  • scipy is another library used for scientific and technical computing. It’s like having a set of precision knives for the more intricate tasks.

Why They Rock: Python is known for its readability and simplicity, making it accessible for beginners. When combined with pandas and scipy, it becomes a powerhouse for data manipulation, analysis, and visualization.

SPSS: The Point-and-Click Professional

SPSS (Statistical Package for the Social Sciences) is a software package used for interactive, or batched, statistical analysis. Long produced by SPSS Inc., it was acquired by IBM in 2009.

Why It Rocks: SPSS is particularly user-friendly with its point-and-click interface, making it a favorite among non-programmers and researchers in the social sciences. It’s like having a kitchen gadget that does the job with the push of a button—no manual setup required.

SAS: The Corporate Chef

SAS (Statistical Analysis System) is a software suite developed for advanced analytics, multivariate analysis, business intelligence, data management, and predictive analytics.

Why It Rocks: SAS is a powerhouse in the corporate world, known for its stability, deep analytical capabilities, and support for large data sets. It’s like the industrial kitchen used by professional chefs to serve hundreds of guests.

Excel: The Accessible Apprentice

Excel might not be a specialized statistical software, but it’s widely accessible and capable of handling basic statistical analyses. Think of Excel as the microwave in your kitchen—it might not be fancy, but it gets the job done for quick and simple tasks.

Why It Rocks: Almost everyone has access to Excel and knows the basics, making it a great starting point for those new to data analysis. Plus, with add-ons like the Analysis ToolPak, Excel’s capabilities can be extended further into statistical territory.

Choosing Your Tool

Selecting the right software tool for statistical analysis is like choosing the right kitchen for your cooking style—it depends on your needs, expertise, and the complexity of your recipes (data).

Whether you’re a coding chef ready to tackle R or Python, or someone who prefers the straightforwardness of SPSS or Excel, there’s a tool out there that’s perfect for your data analysis kitchen.

Ethical Considerations

Digital Ethics and Privacy Abstract Concept

Embarking on a data analysis journey is like setting sail on the vast ocean of information.

Just as a captain needs a compass to navigate the seas safely and responsibly, a data analyst requires a strong sense of ethics to guide their exploration of data.

Ethical considerations in data analysis are the moral compass that ensures we respect privacy, consent, and integrity while uncovering the truths hidden within data. Let’s delve into why ethics are so crucial and what principles you should keep in mind.

Respect for Privacy

Imagine you’ve found a diary filled with personal secrets.

Reading it without permission would be a breach of privacy.

Similarly, when you’re handling data, especially personal or sensitive information, it’s essential to ensure that privacy is protected.

This means not only securing data against unauthorized access but also anonymizing data to prevent individuals from being identified.

Informed Consent

Before you can set sail, you need the ship owner’s permission.

In the world of data, this translates to informed consent. Participants should be fully aware of what their data will be used for and voluntarily agree to participate.

This is particularly important in research or when collecting data directly from individuals. It’s like asking for permission before you start the journey.

Data Integrity

Maintaining data integrity is like keeping the ship’s log accurate and unaltered during your voyage.

It involves ensuring the data is not corrupted or modified inappropriately and that any data analysis is conducted accurately and reliably.

Tampering with data or cherry-picking results to fit a narrative is not just unethical—it’s like falsifying the ship’s log, leading to mistrust and potentially dangerous outcomes.

Avoiding Bias

The sea is vast, and your compass must be calibrated correctly to avoid going off course. Similarly, avoiding bias in data analysis ensures your findings are valid and unbiased.

This means being aware of and actively addressing any personal, cultural, or statistical biases that might skew your analysis.

It’s about striving for objectivity and ensuring your journey is guided by truth, not preconceived notions.

Transparency and Accountability

A trustworthy captain is open about their navigational choices and ready to take responsibility for them.

In data analysis, this translates to transparency about your methods and accountability for your conclusions.

Sharing your methodologies, data sources, and any limitations of your analysis helps build trust and allows others to verify or challenge your findings.

Ethical Use of Findings

Finally, just as a captain must consider the impact of their journey on the wider world, you must consider how your data analysis will be used.

This means thinking about the potential consequences of your findings and striving to ensure they are used to benefit, not harm, society.

It’s about being mindful of the broader implications of your work and using data for good.

Navigating with a Moral Compass

In the realm of data analysis, ethical considerations form the moral compass that guides us through complex moral waters.

They ensure that our work respects individuals’ rights, contributes positively to society, and upholds the highest standards of integrity and professionalism.

Just as a captain navigates the seas with respect for the ocean and its dangers, a data analyst must navigate the world of data with a deep commitment to ethical principles.

This commitment ensures that the insights gained from data analysis serve to enlighten and improve, rather than exploit or harm.

Conclusion and Key Takeaways

And there you have it—a whirlwind tour through the fascinating landscape of statistical methods for data analysis.

From the grounding principles of descriptive and inferential statistics to the nuanced details of regression analysis and beyond, we’ve explored the tools and ethical considerations that guide us in turning raw data into meaningful insights.

The Takeaway

Think of data analysis as embarking on a grand adventure, one where numbers and facts are your map and compass.

Just as every explorer needs to understand the terrain, every aspiring data analyst must grasp these foundational concepts.

Whether it’s summarizing data sets with descriptive statistics, making predictions with inferential statistics, choosing the right statistical test, or navigating the ethical considerations that ensure our analyses benefit society, each aspect is a crucial step on your journey.

The Importance of Preparation

Remember, the key to a successful voyage is preparation.

Cleaning and preparing your data sets the stage for a smooth journey, while choosing the right software tools ensures you have the best equipment at your disposal.

And just as every responsible navigator respects the sea, every data analyst must navigate the ethical dimensions of their work with care and integrity.

Charting Your Course

As you embark on your own data analysis adventures, remember that the path you chart is unique to you.

Your questions will guide your journey, your curiosity will fuel your exploration, and the insights you gain will be your treasure.

The world of data is vast and full of mysteries waiting to be uncovered. With the tools and principles we’ve discussed, you’re well-equipped to start uncovering those mysteries, one data set at a time.

The Journey Ahead

The journey of statistical methods for data analysis is ongoing, and the landscape is ever-evolving.

As new methods emerge and our understanding deepens, there will always be new horizons to explore and new insights to discover.

But the fundamentals we’ve covered will remain your steadfast guide, helping you navigate the challenges and opportunities that lie ahead.

So set your sights on the questions that spark your curiosity, arm yourself with the tools of the trade, and embark on your data analysis journey with confidence.

About The Author

statistical analysis research methods

Silvia Valcheva

Silvia Valcheva is a digital marketer with over a decade of experience creating content for the tech industry. She has a strong passion for writing about emerging software and technologies such as big data, AI (Artificial Intelligence), IoT (Internet of Things), process automation, etc.

Leave a Reply Cancel Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed .

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Indian J Anaesth
  • v.60(9); 2016 Sep

Basic statistical tools in research and data analysis

Zulfiqar ali.

Department of Anaesthesiology, Division of Neuroanaesthesiology, Sheri Kashmir Institute of Medical Sciences, Soura, Srinagar, Jammu and Kashmir, India

S Bala Bhaskar

1 Department of Anaesthesiology and Critical Care, Vijayanagar Institute of Medical Sciences, Bellary, Karnataka, India

Statistical methods involved in carrying out a study include planning, designing, collecting data, analysing, drawing meaningful interpretation and reporting of the research findings. The statistical analysis gives meaning to the meaningless numbers, thereby breathing life into a lifeless data. The results and inferences are precise only if proper statistical tests are used. This article will try to acquaint the reader with the basic research tools that are utilised while conducting various studies. The article covers a brief outline of the variables, an understanding of quantitative and qualitative variables and the measures of central tendency. An idea of the sample size estimation, power analysis and the statistical errors is given. Finally, there is a summary of parametric and non-parametric tests used for data analysis.

INTRODUCTION

Statistics is a branch of science that deals with the collection, organisation, analysis of data and drawing of inferences from the samples to the whole population.[ 1 ] This requires a proper design of the study, an appropriate selection of the study sample and choice of a suitable statistical test. An adequate knowledge of statistics is necessary for proper designing of an epidemiological study or a clinical trial. Improper statistical methods may result in erroneous conclusions which may lead to unethical practice.[ 2 ]

Variable is a characteristic that varies from one individual member of population to another individual.[ 3 ] Variables such as height and weight are measured by some type of scale, convey quantitative information and are called as quantitative variables. Sex and eye colour give qualitative information and are called as qualitative variables[ 3 ] [ Figure 1 ].

An external file that holds a picture, illustration, etc.
Object name is IJA-60-662-g001.jpg

Classification of variables

Quantitative variables

Quantitative or numerical data are subdivided into discrete and continuous measurements. Discrete numerical data are recorded as a whole number such as 0, 1, 2, 3,… (integer), whereas continuous data can assume any value. Observations that can be counted constitute the discrete data and observations that can be measured constitute the continuous data. Examples of discrete data are number of episodes of respiratory arrests or the number of re-intubations in an intensive care unit. Similarly, examples of continuous data are the serial serum glucose levels, partial pressure of oxygen in arterial blood and the oesophageal temperature.

A hierarchical scale of increasing precision can be used for observing and recording the data which is based on categorical, ordinal, interval and ratio scales [ Figure 1 ].

Categorical or nominal variables are unordered. The data are merely classified into categories and cannot be arranged in any particular order. If only two categories exist (as in gender male and female), it is called as a dichotomous (or binary) data. The various causes of re-intubation in an intensive care unit due to upper airway obstruction, impaired clearance of secretions, hypoxemia, hypercapnia, pulmonary oedema and neurological impairment are examples of categorical variables.

Ordinal variables have a clear ordering between the variables. However, the ordered data may not have equal intervals. Examples are the American Society of Anesthesiologists status or Richmond agitation-sedation scale.

Interval variables are similar to an ordinal variable, except that the intervals between the values of the interval variable are equally spaced. A good example of an interval scale is the Fahrenheit degree scale used to measure temperature. With the Fahrenheit scale, the difference between 70° and 75° is equal to the difference between 80° and 85°: The units of measurement are equal throughout the full range of the scale.

Ratio scales are similar to interval scales, in that equal differences between scale values have equal quantitative meaning. However, ratio scales also have a true zero point, which gives them an additional property. For example, the system of centimetres is an example of a ratio scale. There is a true zero point and the value of 0 cm means a complete absence of length. The thyromental distance of 6 cm in an adult may be twice that of a child in whom it may be 3 cm.

STATISTICS: DESCRIPTIVE AND INFERENTIAL STATISTICS

Descriptive statistics[ 4 ] try to describe the relationship between variables in a sample or population. Descriptive statistics provide a summary of data in the form of mean, median and mode. Inferential statistics[ 4 ] use a random sample of data taken from a population to describe and make inferences about the whole population. It is valuable when it is not possible to examine each member of an entire population. The examples if descriptive and inferential statistics are illustrated in Table 1 .

Example of descriptive and inferential statistics

An external file that holds a picture, illustration, etc.
Object name is IJA-60-662-g002.jpg

Descriptive statistics

The extent to which the observations cluster around a central location is described by the central tendency and the spread towards the extremes is described by the degree of dispersion.

Measures of central tendency

The measures of central tendency are mean, median and mode.[ 6 ] Mean (or the arithmetic average) is the sum of all the scores divided by the number of scores. Mean may be influenced profoundly by the extreme variables. For example, the average stay of organophosphorus poisoning patients in ICU may be influenced by a single patient who stays in ICU for around 5 months because of septicaemia. The extreme values are called outliers. The formula for the mean is

An external file that holds a picture, illustration, etc.
Object name is IJA-60-662-g003.jpg

where x = each observation and n = number of observations. Median[ 6 ] is defined as the middle of a distribution in a ranked data (with half of the variables in the sample above and half below the median value) while mode is the most frequently occurring variable in a distribution. Range defines the spread, or variability, of a sample.[ 7 ] It is described by the minimum and maximum values of the variables. If we rank the data and after ranking, group the observations into percentiles, we can get better information of the pattern of spread of the variables. In percentiles, we rank the observations into 100 equal parts. We can then describe 25%, 50%, 75% or any other percentile amount. The median is the 50 th percentile. The interquartile range will be the observations in the middle 50% of the observations about the median (25 th -75 th percentile). Variance[ 7 ] is a measure of how spread out is the distribution. It gives an indication of how close an individual observation clusters about the mean value. The variance of a population is defined by the following formula:

An external file that holds a picture, illustration, etc.
Object name is IJA-60-662-g004.jpg

where σ 2 is the population variance, X is the population mean, X i is the i th element from the population and N is the number of elements in the population. The variance of a sample is defined by slightly different formula:

An external file that holds a picture, illustration, etc.
Object name is IJA-60-662-g005.jpg

where s 2 is the sample variance, x is the sample mean, x i is the i th element from the sample and n is the number of elements in the sample. The formula for the variance of a population has the value ‘ n ’ as the denominator. The expression ‘ n −1’ is known as the degrees of freedom and is one less than the number of parameters. Each observation is free to vary, except the last one which must be a defined value. The variance is measured in squared units. To make the interpretation of the data simple and to retain the basic unit of observation, the square root of variance is used. The square root of the variance is the standard deviation (SD).[ 8 ] The SD of a population is defined by the following formula:

An external file that holds a picture, illustration, etc.
Object name is IJA-60-662-g006.jpg

where σ is the population SD, X is the population mean, X i is the i th element from the population and N is the number of elements in the population. The SD of a sample is defined by slightly different formula:

An external file that holds a picture, illustration, etc.
Object name is IJA-60-662-g007.jpg

where s is the sample SD, x is the sample mean, x i is the i th element from the sample and n is the number of elements in the sample. An example for calculation of variation and SD is illustrated in Table 2 .

Example of mean, variance, standard deviation

An external file that holds a picture, illustration, etc.
Object name is IJA-60-662-g008.jpg

Normal distribution or Gaussian distribution

Most of the biological variables usually cluster around a central value, with symmetrical positive and negative deviations about this point.[ 1 ] The standard normal distribution curve is a symmetrical bell-shaped. In a normal distribution curve, about 68% of the scores are within 1 SD of the mean. Around 95% of the scores are within 2 SDs of the mean and 99% within 3 SDs of the mean [ Figure 2 ].

An external file that holds a picture, illustration, etc.
Object name is IJA-60-662-g009.jpg

Normal distribution curve

Skewed distribution

It is a distribution with an asymmetry of the variables about its mean. In a negatively skewed distribution [ Figure 3 ], the mass of the distribution is concentrated on the right of Figure 1 . In a positively skewed distribution [ Figure 3 ], the mass of the distribution is concentrated on the left of the figure leading to a longer right tail.

An external file that holds a picture, illustration, etc.
Object name is IJA-60-662-g010.jpg

Curves showing negatively skewed and positively skewed distribution

Inferential statistics

In inferential statistics, data are analysed from a sample to make inferences in the larger collection of the population. The purpose is to answer or test the hypotheses. A hypothesis (plural hypotheses) is a proposed explanation for a phenomenon. Hypothesis tests are thus procedures for making rational decisions about the reality of observed effects.

Probability is the measure of the likelihood that an event will occur. Probability is quantified as a number between 0 and 1 (where 0 indicates impossibility and 1 indicates certainty).

In inferential statistics, the term ‘null hypothesis’ ( H 0 ‘ H-naught ,’ ‘ H-null ’) denotes that there is no relationship (difference) between the population variables in question.[ 9 ]

Alternative hypothesis ( H 1 and H a ) denotes that a statement between the variables is expected to be true.[ 9 ]

The P value (or the calculated probability) is the probability of the event occurring by chance if the null hypothesis is true. The P value is a numerical between 0 and 1 and is interpreted by researchers in deciding whether to reject or retain the null hypothesis [ Table 3 ].

P values with interpretation

An external file that holds a picture, illustration, etc.
Object name is IJA-60-662-g011.jpg

If P value is less than the arbitrarily chosen value (known as α or the significance level), the null hypothesis (H0) is rejected [ Table 4 ]. However, if null hypotheses (H0) is incorrectly rejected, this is known as a Type I error.[ 11 ] Further details regarding alpha error, beta error and sample size calculation and factors influencing them are dealt with in another section of this issue by Das S et al .[ 12 ]

Illustration for null hypothesis

An external file that holds a picture, illustration, etc.
Object name is IJA-60-662-g012.jpg

PARAMETRIC AND NON-PARAMETRIC TESTS

Numerical data (quantitative variables) that are normally distributed are analysed with parametric tests.[ 13 ]

Two most basic prerequisites for parametric statistical analysis are:

  • The assumption of normality which specifies that the means of the sample group are normally distributed
  • The assumption of equal variance which specifies that the variances of the samples and of their corresponding population are equal.

However, if the distribution of the sample is skewed towards one side or the distribution is unknown due to the small sample size, non-parametric[ 14 ] statistical techniques are used. Non-parametric tests are used to analyse ordinal and categorical data.

Parametric tests

The parametric tests assume that the data are on a quantitative (numerical) scale, with a normal distribution of the underlying population. The samples have the same variance (homogeneity of variances). The samples are randomly drawn from the population, and the observations within a group are independent of each other. The commonly used parametric tests are the Student's t -test, analysis of variance (ANOVA) and repeated measures ANOVA.

Student's t -test

Student's t -test is used to test the null hypothesis that there is no difference between the means of the two groups. It is used in three circumstances:

An external file that holds a picture, illustration, etc.
Object name is IJA-60-662-g013.jpg

where X = sample mean, u = population mean and SE = standard error of mean

An external file that holds a picture, illustration, etc.
Object name is IJA-60-662-g014.jpg

where X 1 − X 2 is the difference between the means of the two groups and SE denotes the standard error of the difference.

  • To test if the population means estimated by two dependent samples differ significantly (the paired t -test). A usual setting for paired t -test is when measurements are made on the same subjects before and after a treatment.

The formula for paired t -test is:

An external file that holds a picture, illustration, etc.
Object name is IJA-60-662-g015.jpg

where d is the mean difference and SE denotes the standard error of this difference.

The group variances can be compared using the F -test. The F -test is the ratio of variances (var l/var 2). If F differs significantly from 1.0, then it is concluded that the group variances differ significantly.

Analysis of variance

The Student's t -test cannot be used for comparison of three or more groups. The purpose of ANOVA is to test if there is any significant difference between the means of two or more groups.

In ANOVA, we study two variances – (a) between-group variability and (b) within-group variability. The within-group variability (error variance) is the variation that cannot be accounted for in the study design. It is based on random differences present in our samples.

However, the between-group (or effect variance) is the result of our treatment. These two estimates of variances are compared using the F-test.

A simplified formula for the F statistic is:

An external file that holds a picture, illustration, etc.
Object name is IJA-60-662-g016.jpg

where MS b is the mean squares between the groups and MS w is the mean squares within groups.

Repeated measures analysis of variance

As with ANOVA, repeated measures ANOVA analyses the equality of means of three or more groups. However, a repeated measure ANOVA is used when all variables of a sample are measured under different conditions or at different points in time.

As the variables are measured from a sample at different points of time, the measurement of the dependent variable is repeated. Using a standard ANOVA in this case is not appropriate because it fails to model the correlation between the repeated measures: The data violate the ANOVA assumption of independence. Hence, in the measurement of repeated dependent variables, repeated measures ANOVA should be used.

Non-parametric tests

When the assumptions of normality are not met, and the sample means are not normally, distributed parametric tests can lead to erroneous results. Non-parametric tests (distribution-free test) are used in such situation as they do not require the normality assumption.[ 15 ] Non-parametric tests may fail to detect a significant difference when compared with a parametric test. That is, they usually have less power.

As is done for the parametric tests, the test statistic is compared with known values for the sampling distribution of that statistic and the null hypothesis is accepted or rejected. The types of non-parametric analysis techniques and the corresponding parametric analysis techniques are delineated in Table 5 .

Analogue of parametric and non-parametric tests

An external file that holds a picture, illustration, etc.
Object name is IJA-60-662-g017.jpg

Median test for one sample: The sign test and Wilcoxon's signed rank test

The sign test and Wilcoxon's signed rank test are used for median tests of one sample. These tests examine whether one instance of sample data is greater or smaller than the median reference value.

This test examines the hypothesis about the median θ0 of a population. It tests the null hypothesis H0 = θ0. When the observed value (Xi) is greater than the reference value (θ0), it is marked as+. If the observed value is smaller than the reference value, it is marked as − sign. If the observed value is equal to the reference value (θ0), it is eliminated from the sample.

If the null hypothesis is true, there will be an equal number of + signs and − signs.

The sign test ignores the actual values of the data and only uses + or − signs. Therefore, it is useful when it is difficult to measure the values.

Wilcoxon's signed rank test

There is a major limitation of sign test as we lose the quantitative information of the given data and merely use the + or – signs. Wilcoxon's signed rank test not only examines the observed values in comparison with θ0 but also takes into consideration the relative sizes, adding more statistical power to the test. As in the sign test, if there is an observed value that is equal to the reference value θ0, this observed value is eliminated from the sample.

Wilcoxon's rank sum test ranks all data points in order, calculates the rank sum of each sample and compares the difference in the rank sums.

Mann-Whitney test

It is used to test the null hypothesis that two samples have the same median or, alternatively, whether observations in one sample tend to be larger than observations in the other.

Mann–Whitney test compares all data (xi) belonging to the X group and all data (yi) belonging to the Y group and calculates the probability of xi being greater than yi: P (xi > yi). The null hypothesis states that P (xi > yi) = P (xi < yi) =1/2 while the alternative hypothesis states that P (xi > yi) ≠1/2.

Kolmogorov-Smirnov test

The two-sample Kolmogorov-Smirnov (KS) test was designed as a generic method to test whether two random samples are drawn from the same distribution. The null hypothesis of the KS test is that both distributions are identical. The statistic of the KS test is a distance between the two empirical distributions, computed as the maximum absolute difference between their cumulative curves.

Kruskal-Wallis test

The Kruskal–Wallis test is a non-parametric test to analyse the variance.[ 14 ] It analyses if there is any difference in the median values of three or more independent samples. The data values are ranked in an increasing order, and the rank sums calculated followed by calculation of the test statistic.

Jonckheere test

In contrast to Kruskal–Wallis test, in Jonckheere test, there is an a priori ordering that gives it a more statistical power than the Kruskal–Wallis test.[ 14 ]

Friedman test

The Friedman test is a non-parametric test for testing the difference between several related samples. The Friedman test is an alternative for repeated measures ANOVAs which is used when the same parameter has been measured under different conditions on the same subjects.[ 13 ]

Tests to analyse the categorical data

Chi-square test, Fischer's exact test and McNemar's test are used to analyse the categorical or nominal variables. The Chi-square test compares the frequencies and tests whether the observed data differ significantly from that of the expected data if there were no differences between groups (i.e., the null hypothesis). It is calculated by the sum of the squared difference between observed ( O ) and the expected ( E ) data (or the deviation, d ) divided by the expected data by the following formula:

An external file that holds a picture, illustration, etc.
Object name is IJA-60-662-g018.jpg

A Yates correction factor is used when the sample size is small. Fischer's exact test is used to determine if there are non-random associations between two categorical variables. It does not assume random sampling, and instead of referring a calculated statistic to a sampling distribution, it calculates an exact probability. McNemar's test is used for paired nominal data. It is applied to 2 × 2 table with paired-dependent samples. It is used to determine whether the row and column frequencies are equal (that is, whether there is ‘marginal homogeneity’). The null hypothesis is that the paired proportions are equal. The Mantel-Haenszel Chi-square test is a multivariate test as it analyses multiple grouping variables. It stratifies according to the nominated confounding variables and identifies any that affects the primary outcome variable. If the outcome variable is dichotomous, then logistic regression is used.

SOFTWARES AVAILABLE FOR STATISTICS, SAMPLE SIZE CALCULATION AND POWER ANALYSIS

Numerous statistical software systems are available currently. The commonly used software systems are Statistical Package for the Social Sciences (SPSS – manufactured by IBM corporation), Statistical Analysis System ((SAS – developed by SAS Institute North Carolina, United States of America), R (designed by Ross Ihaka and Robert Gentleman from R core team), Minitab (developed by Minitab Inc), Stata (developed by StataCorp) and the MS Excel (developed by Microsoft).

There are a number of web resources which are related to statistical power analyses. A few are:

  • StatPages.net – provides links to a number of online power calculators
  • G-Power – provides a downloadable power analysis program that runs under DOS
  • Power analysis for ANOVA designs an interactive site that calculates power or sample size needed to attain a given power for one effect in a factorial ANOVA design
  • SPSS makes a program called SamplePower. It gives an output of a complete report on the computer screen which can be cut and paste into another document.

It is important that a researcher knows the concepts of the basic statistical methods used for conduct of a research study. This will help to conduct an appropriately well-designed study leading to valid and reliable results. Inappropriate use of statistical techniques may lead to faulty conclusions, inducing errors and undermining the significance of the article. Bad statistics may lead to bad research, and bad research may lead to unethical practice. Hence, an adequate knowledge of statistics and the appropriate use of statistical tests are important. An appropriate knowledge about the basic statistical methods will go a long way in improving the research designs and producing quality medical research which can be utilised for formulating the evidence-based guidelines.

Financial support and sponsorship

Conflicts of interest.

There are no conflicts of interest.

Research Graduate

The Best PhD and Masters Consulting Company

graph, diagram, growth-3033203.jpg

Introduction to Statistical Analysis: A Beginner’s Guide.

Statistical analysis is a crucial component of research work across various disciplines, helping researchers derive meaningful insights from data. Whether you’re conducting scientific studies, social research, or data-driven investigations, having a solid understanding of statistical analysis is essential. In this beginner’s guide, we will explore the fundamental concepts and techniques of statistical analysis specifically tailored for research work, providing you with a strong foundation to enhance the quality and credibility of your research findings.

1. Importance of Statistical Analysis in Research:

Research aims to uncover knowledge and make informed conclusions. Statistical analysis plays a pivotal role in achieving this by providing tools and methods to analyze and interpret data accurately. It helps researchers identify patterns, test hypotheses, draw inferences, and quantify the strength of relationships between variables. Understanding the significance of statistical analysis empowers researchers to make evidence-based decisions.

2. Data Collection and Organization:

Before diving into statistical analysis, researchers must collect and organize their data effectively. We will discuss the importance of proper sampling techniques, data quality assurance, and data preprocessing. Additionally, we will explore methods to handle missing data and outliers, ensuring that your dataset is reliable and suitable for analysis.

3. Exploratory Data Analysis (EDA):

Exploratory Data Analysis is a preliminary step that involves visually exploring and summarizing the main characteristics of the data. We will cover techniques such as data visualization, descriptive statistics, and data transformations to gain insights into the distribution, central tendencies, and variability of the variables in your dataset. EDA helps researchers understand the underlying structure of the data and identify potential relationships for further investigation.

4. Statistical Inference and Hypothesis Testing:

Statistical inference allows researchers to make generalizations about a population based on a sample. We will delve into hypothesis testing, covering concepts such as null and alternative hypotheses, p-values, and significance levels. By understanding these concepts, you will be able to test your research hypotheses and determine if the observed results are statistically significant.

5. Parametric and Non-parametric Tests:

Parametric and non-parametric tests are statistical techniques used to analyze data based on different assumptions about the underlying population distribution. We will explore commonly used parametric tests, such as t-tests and analysis of variance (ANOVA), as well as non-parametric tests like the Mann-Whitney U test and Kruskal-Wallis test. Understanding when to use each type of test is crucial for selecting the appropriate analysis method for your research questions.

6. Correlation and Regression Analysis:

Correlation and regression analysis allow researchers to explore relationships between variables and make predictions. We will cover Pearson correlation coefficients, multiple regression analysis, and logistic regression. These techniques enable researchers to quantify the strength and direction of associations and identify predictive factors in their research.

7. Sample Size Determination and Power Analysis:

Sample size determination is a critical aspect of research design, as it affects the validity and reliability of your findings. We will discuss methods for estimating sample size based on statistical power analysis, ensuring that your study has sufficient statistical power to detect meaningful effects. Understanding sample size determination is essential for planning robust research studies.

Conclusion:

Statistical analysis is an indispensable tool for conducting high-quality research. This beginner’s guide has provided an overview of key concepts and techniques specifically tailored for research work, enabling you to enhance the credibility and reliability of your findings. By understanding the importance of statistical analysis, collecting and organizing data effectively, performing exploratory data analysis, conducting hypothesis testing, utilizing parametric and non-parametric tests, and considering sample size determination, you will be well-equipped to carry out rigorous research and contribute valuable insights to your field. Remember, continuous learning, practice, and seeking guidance from statistical experts will further enhance your skills in statistical analysis for research.

Leave a Comment Cancel Reply

Your email address will not be published. Required fields are marked *

Save my name, email, and website in this browser for the next time I comment.

statistical analysis research methods

Quantitative Data Analysis 101

The lingo, methods and techniques, explained simply.

By: Derek Jansen (MBA)  and Kerryn Warren (PhD) | December 2020

Quantitative data analysis is one of those things that often strikes fear in students. It’s totally understandable – quantitative analysis is a complex topic, full of daunting lingo , like medians, modes, correlation and regression. Suddenly we’re all wishing we’d paid a little more attention in math class…

The good news is that while quantitative data analysis is a mammoth topic, gaining a working understanding of the basics isn’t that hard , even for those of us who avoid numbers and math . In this post, we’ll break quantitative analysis down into simple , bite-sized chunks so you can approach your research with confidence.

Quantitative data analysis methods and techniques 101

Overview: Quantitative Data Analysis 101

  • What (exactly) is quantitative data analysis?
  • When to use quantitative analysis
  • How quantitative analysis works

The two “branches” of quantitative analysis

  • Descriptive statistics 101
  • Inferential statistics 101
  • How to choose the right quantitative methods
  • Recap & summary

What is quantitative data analysis?

Despite being a mouthful, quantitative data analysis simply means analysing data that is numbers-based – or data that can be easily “converted” into numbers without losing any meaning.

For example, category-based variables like gender, ethnicity, or native language could all be “converted” into numbers without losing meaning – for example, English could equal 1, French 2, etc.

This contrasts against qualitative data analysis, where the focus is on words, phrases and expressions that can’t be reduced to numbers. If you’re interested in learning about qualitative analysis, check out our post and video here .

What is quantitative analysis used for?

Quantitative analysis is generally used for three purposes.

  • Firstly, it’s used to measure differences between groups . For example, the popularity of different clothing colours or brands.
  • Secondly, it’s used to assess relationships between variables . For example, the relationship between weather temperature and voter turnout.
  • And third, it’s used to test hypotheses in a scientifically rigorous way. For example, a hypothesis about the impact of a certain vaccine.

Again, this contrasts with qualitative analysis , which can be used to analyse people’s perceptions and feelings about an event or situation. In other words, things that can’t be reduced to numbers.

How does quantitative analysis work?

Well, since quantitative data analysis is all about analysing numbers , it’s no surprise that it involves statistics . Statistical analysis methods form the engine that powers quantitative analysis, and these methods can vary from pretty basic calculations (for example, averages and medians) to more sophisticated analyses (for example, correlations and regressions).

Sounds like gibberish? Don’t worry. We’ll explain all of that in this post. Importantly, you don’t need to be a statistician or math wiz to pull off a good quantitative analysis. We’ll break down all the technical mumbo jumbo in this post.

Need a helping hand?

statistical analysis research methods

As I mentioned, quantitative analysis is powered by statistical analysis methods . There are two main “branches” of statistical methods that are used – descriptive statistics and inferential statistics . In your research, you might only use descriptive statistics, or you might use a mix of both , depending on what you’re trying to figure out. In other words, depending on your research questions, aims and objectives . I’ll explain how to choose your methods later.

So, what are descriptive and inferential statistics?

Well, before I can explain that, we need to take a quick detour to explain some lingo. To understand the difference between these two branches of statistics, you need to understand two important words. These words are population and sample .

First up, population . In statistics, the population is the entire group of people (or animals or organisations or whatever) that you’re interested in researching. For example, if you were interested in researching Tesla owners in the US, then the population would be all Tesla owners in the US.

However, it’s extremely unlikely that you’re going to be able to interview or survey every single Tesla owner in the US. Realistically, you’ll likely only get access to a few hundred, or maybe a few thousand owners using an online survey. This smaller group of accessible people whose data you actually collect is called your sample .

So, to recap – the population is the entire group of people you’re interested in, and the sample is the subset of the population that you can actually get access to. In other words, the population is the full chocolate cake , whereas the sample is a slice of that cake.

So, why is this sample-population thing important?

Well, descriptive statistics focus on describing the sample , while inferential statistics aim to make predictions about the population, based on the findings within the sample. In other words, we use one group of statistical methods – descriptive statistics – to investigate the slice of cake, and another group of methods – inferential statistics – to draw conclusions about the entire cake. There I go with the cake analogy again…

With that out the way, let’s take a closer look at each of these branches in more detail.

Descriptive statistics vs inferential statistics

Branch 1: Descriptive Statistics

Descriptive statistics serve a simple but critically important role in your research – to describe your data set – hence the name. In other words, they help you understand the details of your sample . Unlike inferential statistics (which we’ll get to soon), descriptive statistics don’t aim to make inferences or predictions about the entire population – they’re purely interested in the details of your specific sample .

When you’re writing up your analysis, descriptive statistics are the first set of stats you’ll cover, before moving on to inferential statistics. But, that said, depending on your research objectives and research questions , they may be the only type of statistics you use. We’ll explore that a little later.

So, what kind of statistics are usually covered in this section?

Some common statistical tests used in this branch include the following:

  • Mean – this is simply the mathematical average of a range of numbers.
  • Median – this is the midpoint in a range of numbers when the numbers are arranged in numerical order. If the data set makes up an odd number, then the median is the number right in the middle of the set. If the data set makes up an even number, then the median is the midpoint between the two middle numbers.
  • Mode – this is simply the most commonly occurring number in the data set.
  • In cases where most of the numbers are quite close to the average, the standard deviation will be relatively low.
  • Conversely, in cases where the numbers are scattered all over the place, the standard deviation will be relatively high.
  • Skewness . As the name suggests, skewness indicates how symmetrical a range of numbers is. In other words, do they tend to cluster into a smooth bell curve shape in the middle of the graph, or do they skew to the left or right?

Feeling a bit confused? Let’s look at a practical example using a small data set.

Descriptive statistics example data

On the left-hand side is the data set. This details the bodyweight of a sample of 10 people. On the right-hand side, we have the descriptive statistics. Let’s take a look at each of them.

First, we can see that the mean weight is 72.4 kilograms. In other words, the average weight across the sample is 72.4 kilograms. Straightforward.

Next, we can see that the median is very similar to the mean (the average). This suggests that this data set has a reasonably symmetrical distribution (in other words, a relatively smooth, centred distribution of weights, clustered towards the centre).

In terms of the mode , there is no mode in this data set. This is because each number is present only once and so there cannot be a “most common number”. If there were two people who were both 65 kilograms, for example, then the mode would be 65.

Next up is the standard deviation . 10.6 indicates that there’s quite a wide spread of numbers. We can see this quite easily by looking at the numbers themselves, which range from 55 to 90, which is quite a stretch from the mean of 72.4.

And lastly, the skewness of -0.2 tells us that the data is very slightly negatively skewed. This makes sense since the mean and the median are slightly different.

As you can see, these descriptive statistics give us some useful insight into the data set. Of course, this is a very small data set (only 10 records), so we can’t read into these statistics too much. Also, keep in mind that this is not a list of all possible descriptive statistics – just the most common ones.

But why do all of these numbers matter?

While these descriptive statistics are all fairly basic, they’re important for a few reasons:

  • Firstly, they help you get both a macro and micro-level view of your data. In other words, they help you understand both the big picture and the finer details.
  • Secondly, they help you spot potential errors in the data – for example, if an average is way higher than you’d expect, or responses to a question are highly varied, this can act as a warning sign that you need to double-check the data.
  • And lastly, these descriptive statistics help inform which inferential statistical techniques you can use, as those techniques depend on the skewness (in other words, the symmetry and normality) of the data.

Simply put, descriptive statistics are really important , even though the statistical techniques used are fairly basic. All too often at Grad Coach, we see students skimming over the descriptives in their eagerness to get to the more exciting inferential methods, and then landing up with some very flawed results.

Don’t be a sucker – give your descriptive statistics the love and attention they deserve!

Examples of descriptive statistics

Branch 2: Inferential Statistics

As I mentioned, while descriptive statistics are all about the details of your specific data set – your sample – inferential statistics aim to make inferences about the population . In other words, you’ll use inferential statistics to make predictions about what you’d expect to find in the full population.

What kind of predictions, you ask? Well, there are two common types of predictions that researchers try to make using inferential stats:

  • Firstly, predictions about differences between groups – for example, height differences between children grouped by their favourite meal or gender.
  • And secondly, relationships between variables – for example, the relationship between body weight and the number of hours a week a person does yoga.

In other words, inferential statistics (when done correctly), allow you to connect the dots and make predictions about what you expect to see in the real world population, based on what you observe in your sample data. For this reason, inferential statistics are used for hypothesis testing – in other words, to test hypotheses that predict changes or differences.

Inferential statistics are used to make predictions about what you’d expect to find in the full population, based on the sample.

Of course, when you’re working with inferential statistics, the composition of your sample is really important. In other words, if your sample doesn’t accurately represent the population you’re researching, then your findings won’t necessarily be very useful.

For example, if your population of interest is a mix of 50% male and 50% female , but your sample is 80% male , you can’t make inferences about the population based on your sample, since it’s not representative. This area of statistics is called sampling, but we won’t go down that rabbit hole here (it’s a deep one!) – we’ll save that for another post .

What statistics are usually used in this branch?

There are many, many different statistical analysis methods within the inferential branch and it’d be impossible for us to discuss them all here. So we’ll just take a look at some of the most common inferential statistical methods so that you have a solid starting point.

First up are T-Tests . T-tests compare the means (the averages) of two groups of data to assess whether they’re statistically significantly different. In other words, do they have significantly different means, standard deviations and skewness.

This type of testing is very useful for understanding just how similar or different two groups of data are. For example, you might want to compare the mean blood pressure between two groups of people – one that has taken a new medication and one that hasn’t – to assess whether they are significantly different.

Kicking things up a level, we have ANOVA, which stands for “analysis of variance”. This test is similar to a T-test in that it compares the means of various groups, but ANOVA allows you to analyse multiple groups , not just two groups So it’s basically a t-test on steroids…

Next, we have correlation analysis . This type of analysis assesses the relationship between two variables. In other words, if one variable increases, does the other variable also increase, decrease or stay the same. For example, if the average temperature goes up, do average ice creams sales increase too? We’d expect some sort of relationship between these two variables intuitively , but correlation analysis allows us to measure that relationship scientifically .

Lastly, we have regression analysis – this is quite similar to correlation in that it assesses the relationship between variables, but it goes a step further to understand cause and effect between variables, not just whether they move together. In other words, does the one variable actually cause the other one to move, or do they just happen to move together naturally thanks to another force? Just because two variables correlate doesn’t necessarily mean that one causes the other.

Stats overload…

I hear you. To make this all a little more tangible, let’s take a look at an example of a correlation in action.

Here’s a scatter plot demonstrating the correlation (relationship) between weight and height. Intuitively, we’d expect there to be some relationship between these two variables, which is what we see in this scatter plot. In other words, the results tend to cluster together in a diagonal line from bottom left to top right.

Sample correlation

As I mentioned, these are are just a handful of inferential techniques – there are many, many more. Importantly, each statistical method has its own assumptions and limitations .

For example, some methods only work with normally distributed (parametric) data, while other methods are designed specifically for non-parametric data. And that’s exactly why descriptive statistics are so important – they’re the first step to knowing which inferential techniques you can and can’t use.

Remember that every statistical method has its own assumptions and limitations,  so you need to be aware of these.

How to choose the right analysis method

To choose the right statistical methods, you need to think about two important factors :

  • The type of quantitative data you have (specifically, level of measurement and the shape of the data). And,
  • Your research questions and hypotheses

Let’s take a closer look at each of these.

Factor 1 – Data type

The first thing you need to consider is the type of data you’ve collected (or the type of data you will collect). By data types, I’m referring to the four levels of measurement – namely, nominal, ordinal, interval and ratio. If you’re not familiar with this lingo, check out the video below.

Why does this matter?

Well, because different statistical methods and techniques require different types of data. This is one of the “assumptions” I mentioned earlier – every method has its assumptions regarding the type of data.

For example, some techniques work with categorical data (for example, yes/no type questions, or gender or ethnicity), while others work with continuous numerical data (for example, age, weight or income) – and, of course, some work with multiple data types.

If you try to use a statistical method that doesn’t support the data type you have, your results will be largely meaningless . So, make sure that you have a clear understanding of what types of data you’ve collected (or will collect). Once you have this, you can then check which statistical methods would support your data types here .

If you haven’t collected your data yet, you can work in reverse and look at which statistical method would give you the most useful insights, and then design your data collection strategy to collect the correct data types.

Another important factor to consider is the shape of your data . Specifically, does it have a normal distribution (in other words, is it a bell-shaped curve, centred in the middle) or is it very skewed to the left or the right? Again, different statistical techniques work for different shapes of data – some are designed for symmetrical data while others are designed for skewed data.

This is another reminder of why descriptive statistics are so important – they tell you all about the shape of your data.

Factor 2: Your research questions

The next thing you need to consider is your specific research questions, as well as your hypotheses (if you have some). The nature of your research questions and research hypotheses will heavily influence which statistical methods and techniques you should use.

If you’re just interested in understanding the attributes of your sample (as opposed to the entire population), then descriptive statistics are probably all you need. For example, if you just want to assess the means (averages) and medians (centre points) of variables in a group of people.

On the other hand, if you aim to understand differences between groups or relationships between variables and to infer or predict outcomes in the population, then you’ll likely need both descriptive statistics and inferential statistics.

So, it’s really important to get very clear about your research aims and research questions, as well your hypotheses – before you start looking at which statistical techniques to use.

Never shoehorn a specific statistical technique into your research just because you like it or have some experience with it. Your choice of methods must align with all the factors we’ve covered here.

Time to recap…

You’re still with me? That’s impressive. We’ve covered a lot of ground here, so let’s recap on the key points:

  • Quantitative data analysis is all about  analysing number-based data  (which includes categorical and numerical data) using various statistical techniques.
  • The two main  branches  of statistics are  descriptive statistics  and  inferential statistics . Descriptives describe your sample, whereas inferentials make predictions about what you’ll find in the population.
  • Common  descriptive statistical methods include  mean  (average),  median , standard  deviation  and  skewness .
  • Common  inferential statistical methods include  t-tests ,  ANOVA ,  correlation  and  regression  analysis.
  • To choose the right statistical methods and techniques, you need to consider the  type of data you’re working with , as well as your  research questions  and hypotheses.

statistical analysis research methods

Psst... there’s more!

This post was based on one of our popular Research Bootcamps . If you're working on a research project, you'll definitely want to check this out ...

77 Comments

Oddy Labs

Hi, I have read your article. Such a brilliant post you have created.

Derek Jansen

Thank you for the feedback. Good luck with your quantitative analysis.

Abdullahi Ramat

Thank you so much.

Obi Eric Onyedikachi

Thank you so much. I learnt much well. I love your summaries of the concepts. I had love you to explain how to input data using SPSS

MWASOMOLA, BROWN

Very useful, I have got the concept

Lumbuka Kaunda

Amazing and simple way of breaking down quantitative methods.

Charles Lwanga

This is beautiful….especially for non-statisticians. I have skimmed through but I wish to read again. and please include me in other articles of the same nature when you do post. I am interested. I am sure, I could easily learn from you and get off the fear that I have had in the past. Thank you sincerely.

Essau Sefolo

Send me every new information you might have.

fatime

i need every new information

Dr Peter

Thank you for the blog. It is quite informative. Dr Peter Nemaenzhe PhD

Mvogo Mvogo Ephrem

It is wonderful. l’ve understood some of the concepts in a more compréhensive manner

Maya

Your article is so good! However, I am still a bit lost. I am doing a secondary research on Gun control in the US and increase in crime rates and I am not sure which analysis method I should use?

Joy

Based on the given learning points, this is inferential analysis, thus, use ‘t-tests, ANOVA, correlation and regression analysis’

Peter

Well explained notes. Am an MPH student and currently working on my thesis proposal, this has really helped me understand some of the things I didn’t know.

Jejamaije Mujoro

I like your page..helpful

prashant pandey

wonderful i got my concept crystal clear. thankyou!!

Dailess Banda

This is really helpful , thank you

Lulu

Thank you so much this helped

wossen

Wonderfully explained

Niamatullah zaheer

thank u so much, it was so informative

mona

THANKYOU, this was very informative and very helpful

Thaddeus Ogwoka

This is great GRADACOACH I am not a statistician but I require more of this in my thesis

Include me in your posts.

Alem Teshome

This is so great and fully useful. I would like to thank you again and again.

Mrinal

Glad to read this article. I’ve read lot of articles but this article is clear on all concepts. Thanks for sharing.

Emiola Adesina

Thank you so much. This is a very good foundation and intro into quantitative data analysis. Appreciate!

Josyl Hey Aquilam

You have a very impressive, simple but concise explanation of data analysis for Quantitative Research here. This is a God-send link for me to appreciate research more. Thank you so much!

Lynnet Chikwaikwai

Avery good presentation followed by the write up. yes you simplified statistics to make sense even to a layman like me. Thank so much keep it up. The presenter did ell too. i would like more of this for Qualitative and exhaust more of the test example like the Anova.

Adewole Ikeoluwa

This is a very helpful article, couldn’t have been clearer. Thank you.

Samih Soud ALBusaidi

Awesome and phenomenal information.Well done

Nūr

The video with the accompanying article is super helpful to demystify this topic. Very well done. Thank you so much.

Lalah

thank you so much, your presentation helped me a lot

Anjali

I don’t know how should I express that ur article is saviour for me 🥺😍

Saiqa Aftab Tunio

It is well defined information and thanks for sharing. It helps me a lot in understanding the statistical data.

Funeka Mvandaba

I gain a lot and thanks for sharing brilliant ideas, so wish to be linked on your email update.

Rita Kathomi Gikonyo

Very helpful and clear .Thank you Gradcoach.

Hilaria Barsabal

Thank for sharing this article, well organized and information presented are very clear.

AMON TAYEBWA

VERY INTERESTING AND SUPPORTIVE TO NEW RESEARCHERS LIKE ME. AT LEAST SOME BASICS ABOUT QUANTITATIVE.

Tariq

An outstanding, well explained and helpful article. This will help me so much with my data analysis for my research project. Thank you!

chikumbutso

wow this has just simplified everything i was scared of how i am gonna analyse my data but thanks to you i will be able to do so

Idris Haruna

simple and constant direction to research. thanks

Mbunda Castro

This is helpful

AshikB

Great writing!! Comprehensive and very helpful.

himalaya ravi

Do you provide any assistance for other steps of research methodology like making research problem testing hypothesis report and thesis writing?

Sarah chiwamba

Thank you so much for such useful article!

Lopamudra

Amazing article. So nicely explained. Wow

Thisali Liyanage

Very insightfull. Thanks

Melissa

I am doing a quality improvement project to determine if the implementation of a protocol will change prescribing habits. Would this be a t-test?

Aliyah

The is a very helpful blog, however, I’m still not sure how to analyze my data collected. I’m doing a research on “Free Education at the University of Guyana”

Belayneh Kassahun

tnx. fruitful blog!

Suzanne

So I am writing exams and would like to know how do establish which method of data analysis to use from the below research questions: I am a bit lost as to how I determine the data analysis method from the research questions.

Do female employees report higher job satisfaction than male employees with similar job descriptions across the South African telecommunications sector? – I though that maybe Chi Square could be used here. – Is there a gender difference in talented employees’ actual turnover decisions across the South African telecommunications sector? T-tests or Correlation in this one. – Is there a gender difference in the cost of actual turnover decisions across the South African telecommunications sector? T-tests or Correlation in this one. – What practical recommendations can be made to the management of South African telecommunications companies on leveraging gender to mitigate employee turnover decisions?

Your assistance will be appreciated if I could get a response as early as possible tomorrow

Like

This was quite helpful. Thank you so much.

kidane Getachew

wow I got a lot from this article, thank you very much, keep it up

FAROUK AHMAD NKENGA

Thanks for yhe guidance. Can you send me this guidance on my email? To enable offline reading?

Nosi Ruth Xabendlini

Thank you very much, this service is very helpful.

George William Kiyingi

Every novice researcher needs to read this article as it puts things so clear and easy to follow. Its been very helpful.

Adebisi

Wonderful!!!! you explained everything in a way that anyone can learn. Thank you!!

Miss Annah

I really enjoyed reading though this. Very easy to follow. Thank you

Reza Kia

Many thanks for your useful lecture, I would be really appreciated if you could possibly share with me the PPT of presentation related to Data type?

Protasia Tairo

Thank you very much for sharing, I got much from this article

Fatuma Chobo

This is a very informative write-up. Kindly include me in your latest posts.

naphtal

Very interesting mostly for social scientists

Boy M. Bachtiar

Thank you so much, very helpfull

You’re welcome 🙂

Dr Mafaza Mansoor

woow, its great, its very informative and well understood because of your way of writing like teaching in front of me in simple languages.

Opio Len

I have been struggling to understand a lot of these concepts. Thank you for the informative piece which is written with outstanding clarity.

Eric

very informative article. Easy to understand

Leena Fukey

Beautiful read, much needed.

didin

Always greet intro and summary. I learn so much from GradCoach

Mmusyoka

Quite informative. Simple and clear summary.

Jewel Faver

I thoroughly enjoyed reading your informative and inspiring piece. Your profound insights into this topic truly provide a better understanding of its complexity. I agree with the points you raised, especially when you delved into the specifics of the article. In my opinion, that aspect is often overlooked and deserves further attention.

Shantae

Absolutely!!! Thank you

Thazika Chitimera

Thank you very much for this post. It made me to understand how to do my data analysis.

lule victor

its nice work and excellent job ,you have made my work easier

Pedro Uwadum

Wow! So explicit. Well done.

Submit a Comment Cancel reply

Your email address will not be published. Required fields are marked *

Save my name, email, and website in this browser for the next time I comment.

  • Print Friendly

Table of Contents

Types of statistical analysis, importance of statistical analysis, benefits of statistical analysis, statistical analysis process, statistical analysis methods, statistical analysis software, statistical analysis examples, career in statistical analysis, choose the right program, become proficient in statistics today, understanding statistical analysis: techniques and applications.

What Is Statistical Analysis?

Statistical analysis is the process of collecting and analyzing data in order to discern patterns and trends. It is a method for removing bias from evaluating data by employing numerical analysis. This technique is useful for collecting the interpretations of research, developing statistical models, and planning surveys and studies.

Statistical analysis is a scientific tool in AI and ML that helps collect and analyze large amounts of data to identify common patterns and trends to convert them into meaningful information. In simple words, statistical analysis is a data analysis tool that helps draw meaningful conclusions from raw and unstructured data. 

The conclusions are drawn using statistical analysis facilitating decision-making and helping businesses make future predictions on the basis of past trends. It can be defined as a science of collecting and analyzing data to identify trends and patterns and presenting them. Statistical analysis involves working with numbers and is used by businesses and other institutions to make use of data to derive meaningful information. 

Given below are the 6 types of statistical analysis:

Descriptive Analysis

Descriptive statistical analysis involves collecting, interpreting, analyzing, and summarizing data to present them in the form of charts, graphs, and tables. Rather than drawing conclusions, it simply makes the complex data easy to read and understand.

Inferential Analysis

The inferential statistical analysis focuses on drawing meaningful conclusions on the basis of the data analyzed. It studies the relationship between different variables or makes predictions for the whole population.

Predictive Analysis

Predictive statistical analysis is a type of statistical analysis that analyzes data to derive past trends and predict future events on the basis of them. It uses machine learning algorithms, data mining , data modelling , and artificial intelligence to conduct the statistical analysis of data.

Prescriptive Analysis

The prescriptive analysis conducts the analysis of data and prescribes the best course of action based on the results. It is a type of statistical analysis that helps you make an informed decision. 

Exploratory Data Analysis

Exploratory analysis is similar to inferential analysis, but the difference is that it involves exploring the unknown data associations. It analyzes the potential relationships within the data. 

Causal Analysis

The causal statistical analysis focuses on determining the cause and effect relationship between different variables within the raw data. In simple words, it determines why something happens and its effect on other variables. This methodology can be used by businesses to determine the reason for failure. 

Statistical analysis eliminates unnecessary information and catalogs important data in an uncomplicated manner, making the monumental work of organizing inputs appear so serene. Once the data has been collected, statistical analysis may be utilized for a variety of purposes. Some of them are listed below:

  • The statistical analysis aids in summarizing enormous amounts of data into clearly digestible chunks.
  • The statistical analysis aids in the effective design of laboratory, field, and survey investigations.
  • Statistical analysis may help with solid and efficient planning in any subject of study.
  • Statistical analysis aid in establishing broad generalizations and forecasting how much of something will occur under particular conditions.
  • Statistical methods, which are effective tools for interpreting numerical data, are applied in practically every field of study. Statistical approaches have been created and are increasingly applied in physical and biological sciences, such as genetics.
  • Statistical approaches are used in the job of a businessman, a manufacturer, and a researcher. Statistics departments can be found in banks, insurance businesses, and government agencies.
  • A modern administrator, whether in the public or commercial sector, relies on statistical data to make correct decisions.
  • Politicians can utilize statistics to support and validate their claims while also explaining the issues they address.

Become a Data Science & Business Analytics Professional

  • 11.5 M Expected New Jobs For Data Science And Analytics
  • 28% Annual Job Growth By 2026
  • $46K-$100K Average Annual Salary

Post Graduate Program in Data Analytics

  • Post Graduate Program certificate and Alumni Association membership
  • Exclusive hackathons and Ask me Anything sessions by IBM

Data Analyst

  • Industry-recognized Data Analyst Master’s certificate from Simplilearn
  • Dedicated live sessions by faculty of industry experts

Here's what learners are saying regarding our programs:

Felix Chong

Felix Chong

Project manage , codethink.

After completing this course, I landed a new job & a salary hike of 30%. I now work with Zuhlke Group as a Project Manager.

Gayathri Ramesh

Gayathri Ramesh

Associate data engineer , publicis sapient.

The course was well structured and curated. The live classes were extremely helpful. They made learning more productive and interactive. The program helped me change my domain from a data analyst to an Associate Data Engineer.

Statistical analysis can be called a boon to mankind and has many benefits for both individuals and organizations. Given below are some of the reasons why you should consider investing in statistical analysis:

  • It can help you determine the monthly, quarterly, yearly figures of sales profits, and costs making it easier to make your decisions.
  • It can help you make informed and correct decisions.
  • It can help you identify the problem or cause of the failure and make corrections. For example, it can identify the reason for an increase in total costs and help you cut the wasteful expenses.
  • It can help you conduct market research or analysis and make an effective marketing and sales strategy.
  • It helps improve the efficiency of different processes.

Given below are the 5 steps to conduct a statistical analysis that you should follow:

  • Step 1: Identify and describe the nature of the data that you are supposed to analyze.
  • Step 2: The next step is to establish a relation between the data analyzed and the sample population to which the data belongs. 
  • Step 3: The third step is to create a model that clearly presents and summarizes the relationship between the population and the data.
  • Step 4: Prove if the model is valid or not.
  • Step 5: Use predictive analysis to predict future trends and events likely to happen. 

Although there are various methods used to perform data analysis, given below are the 5 most used and popular methods of statistical analysis:

Mean or average mean is one of the most popular methods of statistical analysis. Mean determines the overall trend of the data and is very simple to calculate. Mean is calculated by summing the numbers in the data set together and then dividing it by the number of data points. Despite the ease of calculation and its benefits, it is not advisable to resort to mean as the only statistical indicator as it can result in inaccurate decision making. 

Standard Deviation

Standard deviation is another very widely used statistical tool or method. It analyzes the deviation of different data points from the mean of the entire data set. It determines how data of the data set is spread around the mean. You can use it to decide whether the research outcomes can be generalized or not. 

Regression is a statistical tool that helps determine the cause and effect relationship between the variables. It determines the relationship between a dependent and an independent variable. It is generally used to predict future trends and events.

Hypothesis Testing

Hypothesis testing can be used to test the validity or trueness of a conclusion or argument against a data set. The hypothesis is an assumption made at the beginning of the research and can hold or be false based on the analysis results. 

Sample Size Determination

Sample size determination or data sampling is a technique used to derive a sample from the entire population, which is representative of the population. This method is used when the size of the population is very large. You can choose from among the various data sampling techniques such as snowball sampling, convenience sampling, and random sampling. 

Everyone can't perform very complex statistical calculations with accuracy making statistical analysis a time-consuming and costly process. Statistical software has become a very important tool for companies to perform their data analysis. The software uses Artificial Intelligence and Machine Learning to perform complex calculations, identify trends and patterns, and create charts, graphs, and tables accurately within minutes. 

Look at the standard deviation sample calculation given below to understand more about statistical analysis.

The weights of 5 pizza bases in cms are as follows:

9

9-6.4 = 2.6

(2.6)2 = 6.76

2

2-6.4 = - 4.4

(-4.4)2 = 19.36

5

5-6.4 = - 1.4

(-1.4)2 = 1.96

4

4-6.4 = - 2.4

(-2.4)2 = 5.76

12

12-6.4 = 5.6

(5.6)2 = 31.36

Calculation of Mean = (9+2+5+4+12)/5 = 32/5 = 6.4

Calculation of mean of squared mean deviation = (6.76+19.36+1.96+5.76+31.36)/5 = 13.04

Sample Variance = 13.04

Standard deviation = √13.04 = 3.611

A Statistical Analyst's career path is determined by the industry in which they work. Anyone interested in becoming a Data Analyst may usually enter the profession and qualify for entry-level Data Analyst positions right out of high school or a certificate program — potentially with a Bachelor's degree in statistics, computer science, or mathematics. Some people go into data analysis from a similar sector such as business, economics, or even the social sciences, usually by updating their skills mid-career with a statistical analytics course.

Statistical Analyst is also a great way to get started in the normally more complex area of data science. A Data Scientist is generally a more senior role than a Data Analyst since it is more strategic in nature and necessitates a more highly developed set of technical abilities, such as knowledge of multiple statistical tools, programming languages, and predictive analytics models.

Aspiring Data Scientists and Statistical Analysts generally begin their careers by learning a programming language such as R or SQL. Following that, they must learn how to create databases, do basic analysis, and make visuals using applications such as Tableau. However, not every Statistical Analyst will need to know how to do all of these things, but if you want to advance in your profession, you should be able to do them all.

Based on your industry and the sort of work you do, you may opt to study Python or R , become an expert at data cleaning, or focus on developing complicated statistical models.

You could also learn a little bit of everything, which might help you take on a leadership role and advance to the position of Senior Data Analyst. A Senior Statistical Analyst with vast and deep knowledge might take on a leadership role leading a team of other Statistical Analysts. Statistical Analysts with extra skill training may be able to advance to Data Scientists or other more senior data analytics positions.

Supercharge your career in AI and ML with Simplilearn's comprehensive courses. Gain the skills and knowledge to transform industries and unleash your true potential. Enroll now and unlock limitless possibilities!

Program Name AI Engineer Post Graduate Program In Artificial Intelligence Post Graduate Program In Artificial Intelligence Geo All Geos All Geos IN/ROW University Simplilearn Purdue Caltech Course Duration 11 Months 11 Months 11 Months Coding Experience Required Basic Basic No Skills You Will Learn 10+ skills including data structure, data manipulation, NumPy, Scikit-Learn, Tableau and more. 16+ skills including chatbots, NLP, Python, Keras and more. 8+ skills including Supervised & Unsupervised Learning Deep Learning Data Visualization, and more. Additional Benefits Get access to exclusive Hackathons, Masterclasses and Ask-Me-Anything sessions by IBM Applied learning via 3 Capstone and 12 Industry-relevant Projects Purdue Alumni Association Membership Free IIMJobs Pro-Membership of 6 months Resume Building Assistance Upto 14 CEU Credits Caltech CTME Circle Membership Cost $$ $$$$ $$$$ Explore Program Explore Program Explore Program

Hope this article assisted you in understanding the importance of statistical analysis in every sphere of life. Artificial Intelligence (AI) can help you perform statistical analysis and data analysis very effectively and efficiently. 

If you are a science wizard and fascinated by statistical analysis, check out this amazing Post Graduate Program In Data Analytics in collaboration with Purdue. With a comprehensive syllabus and real-life projects, this course is one of the most popular courses and will help you with all that you need to know about Artificial Intelligence. 

Data Science & Business Analytics Courses Duration and Fees

Data Science & Business Analytics programs typically range from a few weeks to several months, with fees varying based on program and institution.

Program NameDurationFees

Cohort Starts:

11 months€ 2,290

Cohort Starts:

11 months€ 2,790

Cohort Starts:

3 Months€ 1,999

Cohort Starts:

32 weeks€ 1,790

Cohort Starts:

8 months€ 2,790
11 Months€ 3,790
11 months€ 1,099
11 months€ 1,099

Get Free Certifications with free video courses

Introduction to Data Analytics Course

Data Science & Business Analytics

Introduction to Data Analytics Course

Introduction to Data Science

Introduction to Data Science

Learn from Industry Experts with free Masterclasses

How Can You Master the Art of Data Analysis: Uncover the Path to Career Advancement

Develop Your Career in Data Analytics with Purdue University Professional Certificate

Career Masterclass: How to Get Qualified for a Data Analytics Career

Recommended Reads

Free eBook: Guide To The CCBA And CBAP Certifications

Understanding Statistical Process Control (SPC) and Top Applications

A Complete Guide on the Types of Statistical Studies

Digital Marketing Salary Guide 2021

Top Statistical Tools For Research and Data Analysis

What Is Exploratory Data Analysis? Steps and Market Analysis

Get Affiliated Certifications with Live Class programs

  • PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, and OPM3 are registered marks of the Project Management Institute, Inc.

What is Statistical Analysis? Types, Methods, Software, Examples

Appinio Research · 29.02.2024 · 31min read

What Is Statistical Analysis Types Methods Software Examples

Ever wondered how we make sense of vast amounts of data to make informed decisions? Statistical analysis is the answer. In our data-driven world, statistical analysis serves as a powerful tool to uncover patterns, trends, and relationships hidden within data. From predicting sales trends to assessing the effectiveness of new treatments, statistical analysis empowers us to derive meaningful insights and drive evidence-based decision-making across various fields and industries. In this guide, we'll explore the fundamentals of statistical analysis, popular methods, software tools, practical examples, and best practices to help you harness the power of statistics effectively. Whether you're a novice or an experienced analyst, this guide will equip you with the knowledge and skills to navigate the world of statistical analysis with confidence.

What is Statistical Analysis?

Statistical analysis is a methodical process of collecting, analyzing, interpreting, and presenting data to uncover patterns, trends, and relationships. It involves applying statistical techniques and methodologies to make sense of complex data sets and draw meaningful conclusions.

Importance of Statistical Analysis

Statistical analysis plays a crucial role in various fields and industries due to its numerous benefits and applications:

  • Informed Decision Making : Statistical analysis provides valuable insights that inform decision-making processes in business, healthcare, government, and academia. By analyzing data, organizations can identify trends, assess risks, and optimize strategies for better outcomes.
  • Evidence-Based Research : Statistical analysis is fundamental to scientific research, enabling researchers to test hypotheses, draw conclusions, and validate theories using empirical evidence. It helps researchers quantify relationships, assess the significance of findings, and advance knowledge in their respective fields.
  • Quality Improvement : In manufacturing and quality management, statistical analysis helps identify defects, improve processes, and enhance product quality. Techniques such as Six Sigma and Statistical Process Control (SPC) are used to monitor performance, reduce variation, and achieve quality objectives.
  • Risk Assessment : In finance, insurance, and investment, statistical analysis is used for risk assessment and portfolio management. By analyzing historical data and market trends, analysts can quantify risks, forecast outcomes, and make informed decisions to mitigate financial risks.
  • Predictive Modeling : Statistical analysis enables predictive modeling and forecasting in various domains, including sales forecasting, demand planning, and weather prediction. By analyzing historical data patterns, predictive models can anticipate future trends and outcomes with reasonable accuracy.
  • Healthcare Decision Support : In healthcare, statistical analysis is integral to clinical research, epidemiology, and healthcare management. It helps healthcare professionals assess treatment effectiveness, analyze patient outcomes, and optimize resource allocation for improved patient care.

Statistical Analysis Applications

Statistical analysis finds applications across diverse domains and disciplines, including:

  • Business and Economics : Market research , financial analysis, econometrics, and business intelligence.
  • Healthcare and Medicine : Clinical trials, epidemiological studies, healthcare outcomes research, and disease surveillance.
  • Social Sciences : Survey research, demographic analysis, psychology experiments, and public opinion polls.
  • Engineering : Reliability analysis, quality control, process optimization, and product design.
  • Environmental Science : Environmental monitoring, climate modeling, and ecological research.
  • Education : Educational research, assessment, program evaluation, and learning analytics.
  • Government and Public Policy : Policy analysis, program evaluation, census data analysis, and public administration.
  • Technology and Data Science : Machine learning, artificial intelligence, data mining, and predictive analytics.

These applications demonstrate the versatility and significance of statistical analysis in addressing complex problems and informing decision-making across various sectors and disciplines.

Fundamentals of Statistics

Understanding the fundamentals of statistics is crucial for conducting meaningful analyses. Let's delve into some essential concepts that form the foundation of statistical analysis.

Basic Concepts

Statistics is the science of collecting, organizing, analyzing, and interpreting data to make informed decisions or conclusions. To embark on your statistical journey, familiarize yourself with these fundamental concepts:

  • Population vs. Sample : A population comprises all the individuals or objects of interest in a study, while a sample is a subset of the population selected for analysis. Understanding the distinction between these two entities is vital, as statistical analyses often rely on samples to draw conclusions about populations.
  • Independent Variables : Variables that are manipulated or controlled in an experiment.
  • Dependent Variables : Variables that are observed or measured in response to changes in independent variables.
  • Parameters vs. Statistics : Parameters are numerical measures that describe a population, whereas statistics are numerical measures that describe a sample. For instance, the population mean is denoted by μ (mu), while the sample mean is denoted by x̄ (x-bar).

Descriptive Statistics

Descriptive statistics involve methods for summarizing and describing the features of a dataset. These statistics provide insights into the central tendency, variability, and distribution of the data. Standard measures of descriptive statistics include:

  • Mean : The arithmetic average of a set of values, calculated by summing all values and dividing by the number of observations.
  • Median : The middle value in a sorted list of observations.
  • Mode : The value that appears most frequently in a dataset.
  • Range : The difference between the maximum and minimum values in a dataset.
  • Variance : The average of the squared differences from the mean.
  • Standard Deviation : The square root of the variance, providing a measure of the average distance of data points from the mean.
  • Graphical Techniques : Graphical representations, including histograms, box plots, and scatter plots, offer visual insights into the distribution and relationships within a dataset. These visualizations aid in identifying patterns, outliers, and trends.

Inferential Statistics

Inferential statistics enable researchers to draw conclusions or make predictions about populations based on sample data. These methods allow for generalizations beyond the observed data. Fundamental techniques in inferential statistics include:

  • Null Hypothesis (H0) : The hypothesis that there is no significant difference or relationship.
  • Alternative Hypothesis (H1) : The hypothesis that there is a significant difference or relationship.
  • Confidence Intervals : Confidence intervals provide a range of plausible values for a population parameter. They offer insights into the precision of sample estimates and the uncertainty associated with those estimates.
  • Regression Analysis : Regression analysis examines the relationship between one or more independent variables and a dependent variable. It allows for the prediction of the dependent variable based on the values of the independent variables.
  • Sampling Methods : Sampling methods, such as simple random sampling, stratified sampling, and cluster sampling , are employed to ensure that sample data are representative of the population of interest. These methods help mitigate biases and improve the generalizability of results.

Probability Distributions

Probability distributions describe the likelihood of different outcomes in a statistical experiment. Understanding these distributions is essential for modeling and analyzing random phenomena. Some common probability distributions include:

  • Normal Distribution : The normal distribution, also known as the Gaussian distribution, is characterized by a symmetric, bell-shaped curve. Many natural phenomena follow this distribution, making it widely applicable in statistical analysis.
  • Binomial Distribution : The binomial distribution describes the number of successes in a fixed number of independent Bernoulli trials. It is commonly used to model binary outcomes, such as success or failure, heads or tails.
  • Poisson Distribution : The Poisson distribution models the number of events occurring in a fixed interval of time or space. It is often used to analyze rare or discrete events, such as the number of customer arrivals in a queue within a given time period.

Types of Statistical Analysis

Statistical analysis encompasses a diverse range of methods and approaches, each suited to different types of data and research questions. Understanding the various types of statistical analysis is essential for selecting the most appropriate technique for your analysis. Let's explore some common distinctions in statistical analysis methods.

Parametric vs. Non-parametric Analysis

Parametric and non-parametric analyses represent two broad categories of statistical methods, each with its own assumptions and applications.

  • Parametric Analysis : Parametric methods assume that the data follow a specific probability distribution, often the normal distribution. These methods rely on estimating parameters (e.g., means, variances) from the data. Parametric tests typically provide more statistical power but require stricter assumptions. Examples of parametric tests include t-tests, ANOVA, and linear regression.
  • Non-parametric Analysis : Non-parametric methods make fewer assumptions about the underlying distribution of the data. Instead of estimating parameters, non-parametric tests rely on ranks or other distribution-free techniques. Non-parametric tests are often used when data do not meet the assumptions of parametric tests or when dealing with ordinal or non-normal data. Examples of non-parametric tests include the Wilcoxon rank-sum test, Kruskal-Wallis test, and Spearman correlation.

Descriptive vs. Inferential Analysis

Descriptive and inferential analyses serve distinct purposes in statistical analysis, focusing on summarizing data and making inferences about populations, respectively.

  • Descriptive Analysis : Descriptive statistics aim to describe and summarize the features of a dataset. These statistics provide insights into the central tendency, variability, and distribution of the data. Descriptive analysis techniques include measures of central tendency (e.g., mean, median, mode), measures of dispersion (e.g., variance, standard deviation), and graphical representations (e.g., histograms, box plots).
  • Inferential Analysis : Inferential statistics involve making inferences or predictions about populations based on sample data. These methods allow researchers to generalize findings from the sample to the larger population. Inferential analysis techniques include hypothesis testing, confidence intervals, regression analysis, and sampling methods. These methods help researchers draw conclusions about population parameters, such as means, proportions, or correlations, based on sample data.

Exploratory vs. Confirmatory Analysis

Exploratory and confirmatory analyses represent two different approaches to data analysis, each serving distinct purposes in the research process.

  • Exploratory Analysis : Exploratory data analysis (EDA) focuses on exploring data to discover patterns, relationships, and trends. EDA techniques involve visualizing data, identifying outliers, and generating hypotheses for further investigation. Exploratory analysis is particularly useful in the early stages of research when the goal is to gain insights and generate hypotheses rather than confirm specific hypotheses.
  • Confirmatory Analysis : Confirmatory data analysis involves testing predefined hypotheses or theories based on prior knowledge or assumptions. Confirmatory analysis follows a structured approach, where hypotheses are tested using appropriate statistical methods. Confirmatory analysis is common in hypothesis-driven research, where the goal is to validate or refute specific hypotheses using empirical evidence. Techniques such as hypothesis testing, regression analysis, and experimental design are often employed in confirmatory analysis.

Methods of Statistical Analysis

Statistical analysis employs various methods to extract insights from data and make informed decisions. Let's explore some of the key methods used in statistical analysis and their applications.

Hypothesis Testing

Hypothesis testing is a fundamental concept in statistics, allowing researchers to make decisions about population parameters based on sample data. The process involves formulating null and alternative hypotheses, selecting an appropriate test statistic, determining the significance level, and interpreting the results. Standard hypothesis tests include:

  • t-tests : Used to compare means between two groups.
  • ANOVA (Analysis of Variance) : Extends the t-test to compare means across multiple groups.
  • Chi-square test : Assessing the association between categorical variables.

Regression Analysis

Regression analysis explores the relationship between one or more independent variables and a dependent variable. It is widely used in predictive modeling and understanding the impact of variables on outcomes. Key types of regression analysis include:

  • Simple Linear Regression : Examines the linear relationship between one independent variable and a dependent variable.
  • Multiple Linear Regression : Extends simple linear regression to analyze the relationship between multiple independent variables and a dependent variable.
  • Logistic Regression : Used for predicting binary outcomes or modeling probabilities.

Analysis of Variance (ANOVA)

ANOVA is a statistical technique used to compare means across two or more groups. It partitions the total variability in the data into components attributable to different sources, such as between-group differences and within-group variability. ANOVA is commonly used in experimental design and hypothesis testing scenarios.

Time Series Analysis

Time series analysis deals with analyzing data collected or recorded at successive time intervals. It helps identify patterns, trends, and seasonality in the data. Time series analysis techniques include:

  • Trend Analysis : Identifying long-term trends or patterns in the data.
  • Seasonal Decomposition : Separating the data into seasonal, trend, and residual components.
  • Forecasting : Predicting future values based on historical data.

Survival Analysis

Survival analysis is used to analyze time-to-event data, such as time until death, failure, or occurrence of an event of interest. It is widely used in medical research, engineering, and social sciences to analyze survival probabilities and hazard rates over time.

Factor Analysis

Factor analysis is a statistical method used to identify underlying factors or latent variables that explain patterns of correlations among observed variables. It is commonly used in psychology, sociology, and market research to uncover underlying dimensions or constructs.

Cluster Analysis

Cluster analysis is a multivariate technique that groups similar objects or observations into clusters or segments based on their characteristics. It is widely used in market segmentation, image processing, and biological classification.

Principal Component Analysis (PCA)

PCA is a dimensionality reduction technique used to transform high-dimensional data into a lower-dimensional space while preserving most of the variability in the data. It identifies orthogonal axes (principal components) that capture the maximum variance in the data. PCA is useful for data visualization, feature selection, and data compression.

How to Choose the Right Statistical Analysis Method?

Selecting the appropriate statistical method is crucial for obtaining accurate and meaningful results from your data analysis.

Understanding Data Types and Distribution

Before choosing a statistical method, it's essential to understand the types of data you're working with and their distribution. Different statistical methods are suitable for different types of data:

  • Continuous vs. Categorical Data : Determine whether your data are continuous (e.g., height, weight) or categorical (e.g., gender, race). Parametric methods such as t-tests and regression are typically used for continuous data , while non-parametric methods like chi-square tests are suitable for categorical data.
  • Normality : Assess whether your data follows a normal distribution. Parametric methods often assume normality, so if your data are not normally distributed, non-parametric methods may be more appropriate.

Assessing Assumptions

Many statistical methods rely on certain assumptions about the data. Before applying a method, it's essential to assess whether these assumptions are met:

  • Independence : Ensure that observations are independent of each other. Violations of independence assumptions can lead to biased results.
  • Homogeneity of Variance : Verify that variances are approximately equal across groups, especially in ANOVA and regression analyses. Levene's test or Bartlett's test can be used to assess homogeneity of variance.
  • Linearity : Check for linear relationships between variables, particularly in regression analysis. Residual plots can help diagnose violations of linearity assumptions.

Considering Research Objectives

Your research objectives should guide the selection of the appropriate statistical method.

  • What are you trying to achieve with your analysis? : Determine whether you're interested in comparing groups, predicting outcomes, exploring relationships, or identifying patterns.
  • What type of data are you analyzing? : Choose methods that are suitable for your data type and research questions.
  • Are you testing specific hypotheses or exploring data for insights? : Confirmatory analyses involve testing predefined hypotheses, while exploratory analyses focus on discovering patterns or relationships in the data.

Consulting Statistical Experts

If you're unsure about the most appropriate statistical method for your analysis, don't hesitate to seek advice from statistical experts or consultants:

  • Collaborate with Statisticians : Statisticians can provide valuable insights into the strengths and limitations of different statistical methods and help you select the most appropriate approach.
  • Utilize Resources : Take advantage of online resources, forums, and statistical software documentation to learn about different methods and their applications.
  • Peer Review : Consider seeking feedback from colleagues or peers familiar with statistical analysis to validate your approach and ensure rigor in your analysis.

By carefully considering these factors and consulting with experts when needed, you can confidently choose the suitable statistical method to address your research questions and obtain reliable results.

Statistical Analysis Software

Choosing the right software for statistical analysis is crucial for efficiently processing and interpreting your data. In addition to statistical analysis software, it's essential to consider tools for data collection, which lay the foundation for meaningful analysis.

What is Statistical Analysis Software?

Statistical software provides a range of tools and functionalities for data analysis, visualization, and interpretation. These software packages offer user-friendly interfaces and robust analytical capabilities, making them indispensable tools for researchers, analysts, and data scientists.

  • Graphical User Interface (GUI) : Many statistical software packages offer intuitive GUIs that allow users to perform analyses using point-and-click interfaces. This makes statistical analysis accessible to users with varying levels of programming expertise.
  • Scripting and Programming : Advanced users can leverage scripting and programming capabilities within statistical software to automate analyses, customize functions, and extend the software's functionality.
  • Visualization : Statistical software often includes built-in visualization tools for creating charts, graphs, and plots to visualize data distributions, relationships, and trends.
  • Data Management : These software packages provide features for importing, cleaning, and manipulating datasets, ensuring data integrity and consistency throughout the analysis process.

Popular Statistical Analysis Software

Several statistical software packages are widely used in various industries and research domains. Some of the most popular options include:

  • R : R is a free, open-source programming language and software environment for statistical computing and graphics. It offers a vast ecosystem of packages for data manipulation, visualization, and analysis, making it a popular choice among statisticians and data scientists.
  • Python : Python is a versatile programming language with robust libraries like NumPy, SciPy, and pandas for data analysis and scientific computing. Python's simplicity and flexibility make it an attractive option for statistical analysis, particularly for users with programming experience.
  • SPSS : SPSS (Statistical Package for the Social Sciences) is a comprehensive statistical software package widely used in social science research, marketing, and healthcare. It offers a user-friendly interface and a wide range of statistical procedures for data analysis and reporting.
  • SAS : SAS (Statistical Analysis System) is a powerful statistical software suite used for data management, advanced analytics, and predictive modeling. SAS is commonly employed in industries such as healthcare, finance, and government for data-driven decision-making.
  • Stata : Stata is a statistical software package that provides tools for data analysis, manipulation, and visualization. It is popular in academic research, economics, and social sciences for its robust statistical capabilities and ease of use.
  • MATLAB : MATLAB is a high-level programming language and environment for numerical computing and visualization. It offers built-in functions and toolboxes for statistical analysis, machine learning, and signal processing.

Data Collection Software

In addition to statistical analysis software, data collection software plays a crucial role in the research process. These tools facilitate data collection, management, and organization from various sources, ensuring data quality and reliability.

When it comes to data collection, precision and efficiency are paramount. Appinio offers a seamless solution for gathering real-time consumer insights, empowering you to make informed decisions swiftly. With our intuitive platform, you can define your target audience with precision, launch surveys effortlessly, and access valuable data in minutes.   Experience the power of Appinio and elevate your data collection process today. Ready to see it in action? Book a demo now!

Book a Demo

How to Choose the Right Statistical Analysis Software?

When selecting software for statistical analysis and data collection, consider the following factors:

  • Compatibility : Ensure the software is compatible with your operating system, hardware, and data formats.
  • Usability : Choose software that aligns with your level of expertise and provides features that meet your analysis and data collection requirements.
  • Integration : Consider whether the software integrates with other tools and platforms in your workflow, such as data visualization software or data storage systems.
  • Cost and Licensing : Evaluate the cost of licensing or subscription fees, as well as any additional costs for training, support, or maintenance.

By carefully evaluating these factors and considering your specific analysis and data collection needs, you can select the right software tools to support your research objectives and drive meaningful insights from your data.

Statistical Analysis Examples

Understanding statistical analysis methods is best achieved through practical examples. Let's explore three examples that demonstrate the application of statistical techniques in real-world scenarios.

Example 1: Linear Regression

Scenario : A marketing analyst wants to understand the relationship between advertising spending and sales revenue for a product.

Data : The analyst collects data on monthly advertising expenditures (in dollars) and corresponding sales revenue (in dollars) over the past year.

Analysis : Using simple linear regression, the analyst fits a regression model to the data, where advertising spending is the independent variable (X) and sales revenue is the dependent variable (Y). The regression analysis estimates the linear relationship between advertising spending and sales revenue, allowing the analyst to predict sales based on advertising expenditures.

Result : The regression analysis reveals a statistically significant positive relationship between advertising spending and sales revenue. For every additional dollar spent on advertising, sales revenue increases by an estimated amount (slope coefficient). The analyst can use this information to optimize advertising budgets and forecast sales performance.

Example 2: Hypothesis Testing

Scenario : A pharmaceutical company develops a new drug intended to lower blood pressure. The company wants to determine whether the new drug is more effective than the existing standard treatment.

Data : The company conducts a randomized controlled trial (RCT) involving two groups of participants: one group receives the new drug, and the other receives the standard treatment. Blood pressure measurements are taken before and after the treatment period.

Analysis : The company uses hypothesis testing, specifically a two-sample t-test, to compare the mean reduction in blood pressure between the two groups. The null hypothesis (H0) states that there is no difference in the mean reduction in blood pressure between the two treatments, while the alternative hypothesis (H1) suggests that the new drug is more effective.

Result : The t-test results indicate a statistically significant difference in the mean reduction in blood pressure between the two groups. The company concludes that the new drug is more effective than the standard treatment in lowering blood pressure, based on the evidence from the RCT.

Example 3: ANOVA

Scenario : A researcher wants to compare the effectiveness of three different teaching methods on student performance in a mathematics course.

Data : The researcher conducts an experiment where students are randomly assigned to one of three groups: traditional lecture-based instruction, active learning, or flipped classroom. At the end of the semester, students' scores on a standardized math test are recorded.

Analysis : The researcher performs an analysis of variance (ANOVA) to compare the mean test scores across the three teaching methods. ANOVA assesses whether there are statistically significant differences in mean scores between the groups.

Result : The ANOVA results reveal a significant difference in mean test scores between the three teaching methods. Post-hoc tests, such as Tukey's HSD (Honestly Significant Difference), can be conducted to identify which specific teaching methods differ significantly from each other in terms of student performance.

These examples illustrate how statistical analysis techniques can be applied to address various research questions and make data-driven decisions in different fields. By understanding and applying these methods effectively, researchers and analysts can derive valuable insights from their data to inform decision-making and drive positive outcomes.

Statistical Analysis Best Practices

Statistical analysis is a powerful tool for extracting insights from data, but it's essential to follow best practices to ensure the validity, reliability, and interpretability of your results.

  • Clearly Define Research Questions : Before conducting any analysis, clearly define your research questions or objectives . This ensures that your analysis is focused and aligned with the goals of your study.
  • Choose Appropriate Methods : Select statistical methods suitable for your data type, research design , and objectives. Consider factors such as data distribution, sample size, and assumptions of the chosen method.
  • Preprocess Data : Clean and preprocess your data to remove errors, outliers, and missing values. Data preprocessing steps may include data cleaning, normalization, and transformation to ensure data quality and consistency.
  • Check Assumptions : Verify that the assumptions of the chosen statistical methods are met. Assumptions may include normality, homogeneity of variance, independence, and linearity. Conduct diagnostic tests or exploratory data analysis to assess assumptions.
  • Transparent Reporting : Document your analysis procedures, including data preprocessing steps, statistical methods used, and any assumptions made. Transparent reporting enhances reproducibility and allows others to evaluate the validity of your findings.
  • Consider Sample Size : Ensure that your sample size is sufficient to detect meaningful effects or relationships. Power analysis can help determine the minimum sample size required to achieve adequate statistical power.
  • Interpret Results Cautiously : Interpret statistical results with caution and consider the broader context of your research. Be mindful of effect sizes, confidence intervals, and practical significance when interpreting findings.
  • Validate Findings : Validate your findings through robustness checks, sensitivity analyses, or replication studies. Cross-validation and bootstrapping techniques can help assess the stability and generalizability of your results.
  • Avoid P-Hacking and Data Dredging : Guard against p-hacking and data dredging by pre-registering hypotheses, conducting planned analyses, and avoiding selective reporting of results. Maintain transparency and integrity in your analysis process.

By following these best practices, you can conduct rigorous and reliable statistical analyses that yield meaningful insights and contribute to evidence-based decision-making in your field.

Conclusion for Statistical Analysis

Statistical analysis is a vital tool for making sense of data and guiding decision-making across diverse fields. By understanding the fundamentals of statistical analysis, including concepts like hypothesis testing, regression analysis, and data visualization, you gain the ability to extract valuable insights from complex datasets. Moreover, selecting the appropriate statistical methods, choosing the right software, and following best practices ensure the validity and reliability of your analyses. In today's data-driven world, the ability to conduct rigorous statistical analysis is a valuable skill that empowers individuals and organizations to make informed decisions and drive positive outcomes. Whether you're a researcher, analyst, or decision-maker, mastering statistical analysis opens doors to new opportunities for understanding the world around us and unlocking the potential of data to solve real-world problems.

How to Collect Data for Statistical Analysis in Minutes?

Introducing Appinio , your gateway to effortless data collection for statistical analysis. As a real-time market research platform, Appinio specializes in delivering instant consumer insights, empowering businesses to make swift, data-driven decisions.

With Appinio, conducting your own market research is not only feasible but also exhilarating. Here's why:

  • Obtain insights in minutes, not days:  From posing questions to uncovering insights, Appinio accelerates the entire research process, ensuring rapid access to valuable data.
  • User-friendly interface:  No advanced degrees required! Our platform is designed to be intuitive and accessible to anyone, allowing you to dive into market research with confidence.
  • Targeted surveys, global reach:  Define your target audience with precision using our extensive array of demographic and psychographic characteristics, and reach respondents in over 90 countries effortlessly.

Join the loop 💌

Be the first to hear about new updates, product news, and data insights. We'll send it all straight to your inbox.

Get the latest market research news straight to your inbox! 💌

Wait, there's more

What is Convenience Sampling Definition Method Examples

29.08.2024 | 32min read

What is Convenience Sampling? Definition, Method, Examples

What is Brand Architecture Models Strategy Examples

27.08.2024 | 34min read

What is Brand Architecture? Models, Strategy, Examples

What is Voice of the Customer VoC Program Examples

22.08.2024 | 32min read

What is Voice of the Customer (VoC)? Program, Examples

Enago Academy

Effective Use of Statistics in Research – Methods and Tools for Data Analysis

' src=

Remember that impending feeling you get when you are asked to analyze your data! Now that you have all the required raw data, you need to statistically prove your hypothesis. Representing your numerical data as part of statistics in research will also help in breaking the stereotype of being a biology student who can’t do math.

Statistical methods are essential for scientific research. In fact, statistical methods dominate the scientific research as they include planning, designing, collecting data, analyzing, drawing meaningful interpretation and reporting of research findings. Furthermore, the results acquired from research project are meaningless raw data unless analyzed with statistical tools. Therefore, determining statistics in research is of utmost necessity to justify research findings. In this article, we will discuss how using statistical methods for biology could help draw meaningful conclusion to analyze biological studies.

Table of Contents

Role of Statistics in Biological Research

Statistics is a branch of science that deals with collection, organization and analysis of data from the sample to the whole population. Moreover, it aids in designing a study more meticulously and also give a logical reasoning in concluding the hypothesis. Furthermore, biology study focuses on study of living organisms and their complex living pathways, which are very dynamic and cannot be explained with logical reasoning. However, statistics is more complex a field of study that defines and explains study patterns based on the sample sizes used. To be precise, statistics provides a trend in the conducted study.

Biological researchers often disregard the use of statistics in their research planning, and mainly use statistical tools at the end of their experiment. Therefore, giving rise to a complicated set of results which are not easily analyzed from statistical tools in research. Statistics in research can help a researcher approach the study in a stepwise manner, wherein the statistical analysis in research follows –

1. Establishing a Sample Size

Usually, a biological experiment starts with choosing samples and selecting the right number of repetitive experiments. Statistics in research deals with basics in statistics that provides statistical randomness and law of using large samples. Statistics teaches how choosing a sample size from a random large pool of sample helps extrapolate statistical findings and reduce experimental bias and errors.

2. Testing of Hypothesis

When conducting a statistical study with large sample pool, biological researchers must make sure that a conclusion is statistically significant. To achieve this, a researcher must create a hypothesis before examining the distribution of data. Furthermore, statistics in research helps interpret the data clustered near the mean of distributed data or spread across the distribution. These trends help analyze the sample and signify the hypothesis.

3. Data Interpretation Through Analysis

When dealing with large data, statistics in research assist in data analysis. This helps researchers to draw an effective conclusion from their experiment and observations. Concluding the study manually or from visual observation may give erroneous results; therefore, thorough statistical analysis will take into consideration all the other statistical measures and variance in the sample to provide a detailed interpretation of the data. Therefore, researchers produce a detailed and important data to support the conclusion.

Types of Statistical Research Methods That Aid in Data Analysis

statistics in research

Statistical analysis is the process of analyzing samples of data into patterns or trends that help researchers anticipate situations and make appropriate research conclusions. Based on the type of data, statistical analyses are of the following type:

1. Descriptive Analysis

The descriptive statistical analysis allows organizing and summarizing the large data into graphs and tables . Descriptive analysis involves various processes such as tabulation, measure of central tendency, measure of dispersion or variance, skewness measurements etc.

2. Inferential Analysis

The inferential statistical analysis allows to extrapolate the data acquired from a small sample size to the complete population. This analysis helps draw conclusions and make decisions about the whole population on the basis of sample data. It is a highly recommended statistical method for research projects that work with smaller sample size and meaning to extrapolate conclusion for large population.

3. Predictive Analysis

Predictive analysis is used to make a prediction of future events. This analysis is approached by marketing companies, insurance organizations, online service providers, data-driven marketing, and financial corporations.

4. Prescriptive Analysis

Prescriptive analysis examines data to find out what can be done next. It is widely used in business analysis for finding out the best possible outcome for a situation. It is nearly related to descriptive and predictive analysis. However, prescriptive analysis deals with giving appropriate suggestions among the available preferences.

5. Exploratory Data Analysis

EDA is generally the first step of the data analysis process that is conducted before performing any other statistical analysis technique. It completely focuses on analyzing patterns in the data to recognize potential relationships. EDA is used to discover unknown associations within data, inspect missing data from collected data and obtain maximum insights.

6. Causal Analysis

Causal analysis assists in understanding and determining the reasons behind “why” things happen in a certain way, as they appear. This analysis helps identify root cause of failures or simply find the basic reason why something could happen. For example, causal analysis is used to understand what will happen to the provided variable if another variable changes.

7. Mechanistic Analysis

This is a least common type of statistical analysis. The mechanistic analysis is used in the process of big data analytics and biological science. It uses the concept of understanding individual changes in variables that cause changes in other variables correspondingly while excluding external influences.

Important Statistical Tools In Research

Researchers in the biological field find statistical analysis in research as the scariest aspect of completing research. However, statistical tools in research can help researchers understand what to do with data and how to interpret the results, making this process as easy as possible.

1. Statistical Package for Social Science (SPSS)

It is a widely used software package for human behavior research. SPSS can compile descriptive statistics, as well as graphical depictions of result. Moreover, it includes the option to create scripts that automate analysis or carry out more advanced statistical processing.

2. R Foundation for Statistical Computing

This software package is used among human behavior research and other fields. R is a powerful tool and has a steep learning curve. However, it requires a certain level of coding. Furthermore, it comes with an active community that is engaged in building and enhancing the software and the associated plugins.

3. MATLAB (The Mathworks)

It is an analytical platform and a programming language. Researchers and engineers use this software and create their own code and help answer their research question. While MatLab can be a difficult tool to use for novices, it offers flexibility in terms of what the researcher needs.

4. Microsoft Excel

Not the best solution for statistical analysis in research, but MS Excel offers wide variety of tools for data visualization and simple statistics. It is easy to generate summary and customizable graphs and figures. MS Excel is the most accessible option for those wanting to start with statistics.

5. Statistical Analysis Software (SAS)

It is a statistical platform used in business, healthcare, and human behavior research alike. It can carry out advanced analyzes and produce publication-worthy figures, tables and charts .

6. GraphPad Prism

It is a premium software that is primarily used among biology researchers. But, it offers a range of variety to be used in various other fields. Similar to SPSS, GraphPad gives scripting option to automate analyses to carry out complex statistical calculations.

This software offers basic as well as advanced statistical tools for data analysis. However, similar to GraphPad and SPSS, minitab needs command over coding and can offer automated analyses.

Use of Statistical Tools In Research and Data Analysis

Statistical tools manage the large data. Many biological studies use large data to analyze the trends and patterns in studies. Therefore, using statistical tools becomes essential, as they manage the large data sets, making data processing more convenient.

Following these steps will help biological researchers to showcase the statistics in research in detail, and develop accurate hypothesis and use correct tools for it.

There are a range of statistical tools in research which can help researchers manage their research data and improve the outcome of their research by better interpretation of data. You could use statistics in research by understanding the research question, knowledge of statistics and your personal experience in coding.

Have you faced challenges while using statistics in research? How did you manage it? Did you use any of the statistical tools to help you with your research data? Do write to us or comment below!

Frequently Asked Questions

Statistics in research can help a researcher approach the study in a stepwise manner: 1. Establishing a sample size 2. Testing of hypothesis 3. Data interpretation through analysis

Statistical methods are essential for scientific research. In fact, statistical methods dominate the scientific research as they include planning, designing, collecting data, analyzing, drawing meaningful interpretation and reporting of research findings. Furthermore, the results acquired from research project are meaningless raw data unless analyzed with statistical tools. Therefore, determining statistics in research is of utmost necessity to justify research findings.

Statistical tools in research can help researchers understand what to do with data and how to interpret the results, making this process as easy as possible. They can manage large data sets, making data processing more convenient. A great number of tools are available to carry out statistical analysis of data like SPSS, SAS (Statistical Analysis Software), and Minitab.

' src=

nice article to read

Holistic but delineating. A very good read.

Rate this article Cancel Reply

Your email address will not be published.

statistical analysis research methods

Enago Academy's Most Popular Articles

Empowering Researchers, Enabling Progress: How Enago Academy contributes to the SDGs

  • Promoting Research
  • Thought Leadership
  • Trending Now

How Enago Academy Contributes to Sustainable Development Goals (SDGs) Through Empowering Researchers

The United Nations Sustainable Development Goals (SDGs) are a universal call to action to end…

Research Interviews for Data Collection

  • Reporting Research

Research Interviews: An effective and insightful way of data collection

Research interviews play a pivotal role in collecting data for various academic, scientific, and professional…

Planning Your Data Collection

Planning Your Data Collection: Designing methods for effective research

Planning your research is very important to obtain desirable results. In research, the relevance of…

best plagiarism checker

  • Language & Grammar

Best Plagiarism Checker Tool for Researchers — Top 4 to choose from!

While common writing issues like language enhancement, punctuation errors, grammatical errors, etc. can be dealt…

Year

  • Industry News
  • Publishing News

2022 in a Nutshell — Reminiscing the year when opportunities were seized and feats were achieved!

It’s beginning to look a lot like success! Some of the greatest opportunities to research…

2022 in a Nutshell — Reminiscing the year when opportunities were seized and feats…

statistical analysis research methods

Sign-up to read more

Subscribe for free to get unrestricted access to all our resources on research writing and academic publishing including:

  • 2000+ blog articles
  • 50+ Webinars
  • 10+ Expert podcasts
  • 50+ Infographics
  • 10+ Checklists
  • Research Guides

We hate spam too. We promise to protect your privacy and never spam you.

  • Publishing Research
  • AI in Academia
  • Career Corner
  • Diversity and Inclusion
  • Infographics
  • Expert Video Library
  • Other Resources
  • Enago Learn
  • Upcoming & On-Demand Webinars
  • Peer Review Week 2024
  • Open Access Week 2023
  • Conference Videos
  • Enago Report
  • Journal Finder
  • Enago Plagiarism & AI Grammar Check
  • Editing Services
  • Publication Support Services
  • Research Impact
  • Translation Services
  • Publication solutions
  • AI-Based Solutions
  • Call for Articles
  • Call for Speakers
  • Author Training
  • Edit Profile

I am looking for Editing/ Proofreading services for my manuscript Tentative date of next journal submission:

statistical analysis research methods

In your opinion, what is the most effective way to improve integrity in the peer review process?

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings
  • My Bibliography
  • Collections
  • Citation manager

Save citation to file

Email citation, add to collections.

  • Create a new collection
  • Add to an existing collection

Add to My Bibliography

Your saved search, create a file for external citation management software, your rss feed.

  • Search in PubMed
  • Search in NLM Catalog
  • Add to Search

Comprehensive guidelines for appropriate statistical analysis methods in research

Affiliations.

  • 1 Department of Anesthesiology and Pain Medicine, Daegu Catholic University School of Medicine, Daegu, Korea.
  • 2 Department of Medical Statistics, Daegu Catholic University School of Medicine, Daegu, Korea.
  • PMID: 39210669
  • DOI: 10.4097/kja.24016

Background: The selection of statistical analysis methods in research is a critical and nuanced task that requires a scientific and rational approach. Aligning the chosen method with the specifics of the research design and hypothesis is paramount, as it can significantly impact the reliability and quality of the research outcomes.

Methods: This study explores a comprehensive guideline for systematically choosing appropriate statistical analysis methods, with a particular focus on the statistical hypothesis testing stage and categorization of variables. By providing a detailed examination of these aspects, this study aims to provide researchers with a solid foundation for informed methodological decision making. Moving beyond theoretical considerations, this study delves into the practical realm by examining the null and alternative hypotheses tailored to specific statistical methods of analysis. The dynamic relationship between these hypotheses and statistical methods is thoroughly explored, and a carefully crafted flowchart for selecting the statistical analysis method is proposed.

Results: Based on the flowchart, we examined whether exemplary research papers appropriately used statistical methods that align with the variables chosen and hypotheses built for the research. This iterative process ensures the adaptability and relevance of this flowchart across diverse research contexts, contributing to both theoretical insights and tangible tools for methodological decision-making.

Conclusions: This study emphasizes the importance of a scientific and rational approach for the selection of statistical analysis methods. By providing comprehensive guidelines, insights into the null and alternative hypotheses, and a practical flowchart, this study aims to empower researchers and enhance the overall quality and reliability of scientific studies.

Keywords: Algorithms; Biostatistics; Data analysis; Guideline; Statistical data interpretation; Statistical model..

PubMed Disclaimer

  • Citation Manager

NCBI Literature Resources

MeSH PMC Bookshelf Disclaimer

The PubMed wordmark and PubMed logo are registered trademarks of the U.S. Department of Health and Human Services (HHS). Unauthorized use of these marks is strictly prohibited.

  • Skip to main content
  • Skip to primary sidebar
  • Skip to footer
  • QuestionPro

survey software icon

  • Solutions Industries Gaming Automotive Sports and events Education Government Travel & Hospitality Financial Services Healthcare Cannabis Technology Use Case AskWhy Communities Audience Contactless surveys Mobile LivePolls Member Experience GDPR Positive People Science 360 Feedback Surveys
  • Resources Blog eBooks Survey Templates Case Studies Training Help center

statistical analysis research methods

Home Market Research Research Tools and Apps

5 Statistical Analysis Methods for Research and Analysis

Unlocking the value of corporate analytics starts with knowing the statistical analysis methods. Top 5 methods to improve business decisions.

It all boils down to using the power of statistical analysis methods, which is how academics collaborate and collect data to identify trends and patterns.

Over the last ten years, everyday business has undergone a significant transformation. It’s not very uncommon for things to still appear to be the same, whether it’s the technology used in workspaces or the software used to communicate.

There is now an overwhelming amount of information available that was once rare. But it could be overwhelming if you don’t have the slightest concept of going through your company’s data to find meaningful and accurate meaning.

5 different statistical analysis methods will be covered in this blog, along with a detailed discussion of each method.

What is a statistical analysis method?

The practice of gathering and analyzing data to identify patterns and trends is known as statistical analysis . It is a method for eliminating bias from data evaluation by using numerical analysis. Data analytics and data analysis are closely related processes that involve extracting insights from data to make informed decisions.

And these statistical analysis methods are beneficial for gathering research interpretations, creating statistical models, and organizing surveys and studies.

Data analysis employs two basic statistical methods:

  • Descriptive statistics, which use indexes like mean and median to summarize data,
  • Inferential statistics , extrapolate results from data by utilizing statistical tests like the student t-test.

LEARN ABOUT: Descriptive Analysis

The following three factors determine whether a statistical approach is most appropriate:

  • The study’s goal and primary purpose,
  • The kind and dispersion of the data utilized, and
  • The type of observations (Paired/Unpaired).

“Parametric” refers to all types of statistical procedures used to compare means. In contrast, “nonparametric” refers to statistical methods that compare measures other than means, such as medians, mean ranks, and proportions.

For each unique circumstance, statistical analytic methods in biostatistics can be used to analyze and interpret the data. Knowing the assumptions and conditions of the statistical methods is necessary for choosing the best statistical method for data analysis.

Whether you’re a data scientist or not, there’s no doubt that big data is taking the globe by storm. As a result, you must be aware of where to begin. There are 5 options for this statistical analysis method:

Big data is taking over the globe, no matter how you slice it. Mean, more often known as the average, is the initial technique used to conduct the statistical analysis. To find the mean, add a list of numbers, divide that total by the list’s components, and then add another list of numbers.

When this technique is applied, it is possible to quickly view the data while also determining the overall trend of the data collection . The straightforward and quick calculation is also advantageous to the method’s users.

The center of the data under consideration is determined using the statistical mean. The outcome is known as the presented data’s mean. Real-world interactions involving research, education, and athletics frequently use derogatory language. Consider how frequently a baseball player’s batting average—their mean—is brought up in conversation if you consider yourself a data scientist. As a result, you must be aware of where to begin.

Standard deviation

A statistical technique called standard deviation measures how widely distributed the data is from the mean.

When working with data, a high standard deviation indicates that the data is widely dispersed from the mean. A low deviation indicates that most data is in line with the mean and can also be referred to as the set’s expected value.

Standard deviation is frequently used when analyzing the dispersion of data points—whether or not they are clustered.

Imagine you are a marketer who just finished a client survey. Suppose you want to determine whether a bigger group of customers will likely provide the same responses. In that case, you should assess the responses’ dependability after receiving the survey findings. If the standard deviation is low, a greater range of customers may be projected with the answers.

Regression in statistics studies the connection between an independent variable and a dependent variable (the information you’re trying to assess) (the data used to predict the dependent variable).

It can also be explained in terms of how one variable influences another, or how changes in one inconsistent result in changes in another, or vice versa, simple cause and effect. It suggests that the result depends on one or more factors.

Regression analysis graphs and charts employ lines to indicate trends over a predetermined period as well as the strength or weakness of the correlations between the variables.

Hypothesis testing

The two sets of random variables inside the data set must be tested using hypothesis testing, sometimes referred to as “T Testing,” in statistical analysis.

This approach focuses on determining whether a given claim or conclusion holds for the data collection. It enables a comparison of the data with numerous assumptions and hypotheses. It can also help in predicting how choices will impact the company.

A hypothesis test in statistics determines a quantity under a particular assumption. The test’s outcome indicates whether the assumption is correct or whether it has been broken. The null hypothesis, sometimes known as hypothesis 0, is this presumption. The first hypothesis, often known as hypothesis 1, is any other theory that would conflict with hypothesis 0.

When you perform hypothesis testing, the test’s results are statistically significant if they demonstrate that the event could not have occurred by chance or at random.

Sample size determination

When evaluating data for statistical analysis, gathering reliable data can occasionally be challenging since the dataset is too huge. When this is the case, the majority choose the method known as sample size determination , which involves examining a sample or smaller data size.

You must choose the appropriate sample size for accuracy to complete this task effectively. You won’t get reliable results after your analysis if the sample size is too small.

You will use several data sampling techniques to achieve this result. To accomplish this, you may send a survey to your customers and then use the straightforward random sampling method to select the customer data for random analysis.

Conversely, excessive sample size can result in time and money lost. You can look at factors like cost, time, or the ease of data collection to decide the sample size.

Are you confused? Don’t worry! you can use our sample size calculator .

LEARN ABOUT: Theoretical Research

The ability to think analytically is vital for corporate success. Since data is one of the most important resources available today, using it effectively can result in better outcomes and decision-making.

Regardless of the statistical analysis methods you select, be sure to pay close attention to each potential drawback and its particular formula. No method is right or wrong, and there is no gold standard. It will depend on the information you’ve gathered and the conclusions you hope to draw.

By using QuestionPro, you can make crucial judgments more efficiently while better comprehending your clients and other study subjects. Use the features of the enterprise-grade research suite right away!

LEARN MORE         FREE TRIAL

MORE LIKE THIS

statistical analysis research methods

Why You Should Attend XDAY 2024

Aug 30, 2024

Alchemer vs Qualtrics

Alchemer vs Qualtrics: Find out which one you should choose

target population

Target Population: What It Is + Strategies for Targeting

Aug 29, 2024

Microsoft Customer Voice vs QuestionPro

Microsoft Customer Voice vs QuestionPro: Choosing the Best

Other categories.

  • Academic Research
  • Artificial Intelligence
  • Assessments
  • Brand Awareness
  • Case Studies
  • Communities
  • Consumer Insights
  • Customer effort score
  • Customer Engagement
  • Customer Experience
  • Customer Loyalty
  • Customer Research
  • Customer Satisfaction
  • Employee Benefits
  • Employee Engagement
  • Employee Retention
  • Friday Five
  • General Data Protection Regulation
  • Insights Hub
  • Life@QuestionPro
  • Market Research
  • Mobile diaries
  • Mobile Surveys
  • New Features
  • Online Communities
  • Question Types
  • Questionnaire
  • QuestionPro Products
  • Release Notes
  • Research Tools and Apps
  • Revenue at Risk
  • Survey Templates
  • Training Tips
  • Tuesday CX Thoughts (TCXT)
  • Uncategorized
  • What’s Coming Up
  • Workforce Intelligence

Have a thesis expert improve your writing

Check your thesis for plagiarism in 10 minutes, generate your apa citations for free.

  • Knowledge Base

The Beginner's Guide to Statistical Analysis | 5 Steps & Examples

Statistical analysis means investigating trends, patterns, and relationships using quantitative data . It is an important research tool used by scientists, governments, businesses, and other organisations.

To draw valid conclusions, statistical analysis requires careful planning from the very start of the research process . You need to specify your hypotheses and make decisions about your research design, sample size, and sampling procedure.

After collecting data from your sample, you can organise and summarise the data using descriptive statistics . Then, you can use inferential statistics to formally test hypotheses and make estimates about the population. Finally, you can interpret and generalise your findings.

This article is a practical introduction to statistical analysis for students and researchers. We’ll walk you through the steps using two research examples. The first investigates a potential cause-and-effect relationship, while the second investigates a potential correlation between variables.

Table of contents

Step 1: write your hypotheses and plan your research design, step 2: collect data from a sample, step 3: summarise your data with descriptive statistics, step 4: test hypotheses or make estimates with inferential statistics, step 5: interpret your results, frequently asked questions about statistics.

To collect valid data for statistical analysis, you first need to specify your hypotheses and plan out your research design.

Writing statistical hypotheses

The goal of research is often to investigate a relationship between variables within a population . You start with a prediction, and use statistical analysis to test that prediction.

A statistical hypothesis is a formal way of writing a prediction about a population. Every research prediction is rephrased into null and alternative hypotheses that can be tested using sample data.

While the null hypothesis always predicts no effect or no relationship between variables, the alternative hypothesis states your research prediction of an effect or relationship.

  • Null hypothesis: A 5-minute meditation exercise will have no effect on math test scores in teenagers.
  • Alternative hypothesis: A 5-minute meditation exercise will improve math test scores in teenagers.
  • Null hypothesis: Parental income and GPA have no relationship with each other in college students.
  • Alternative hypothesis: Parental income and GPA are positively correlated in college students.

Planning your research design

A research design is your overall strategy for data collection and analysis. It determines the statistical tests you can use to test your hypothesis later on.

First, decide whether your research will use a descriptive, correlational, or experimental design. Experiments directly influence variables, whereas descriptive and correlational studies only measure variables.

  • In an experimental design , you can assess a cause-and-effect relationship (e.g., the effect of meditation on test scores) using statistical tests of comparison or regression.
  • In a correlational design , you can explore relationships between variables (e.g., parental income and GPA) without any assumption of causality using correlation coefficients and significance tests.
  • In a descriptive design , you can study the characteristics of a population or phenomenon (e.g., the prevalence of anxiety in U.S. college students) using statistical tests to draw inferences from sample data.

Your research design also concerns whether you’ll compare participants at the group level or individual level, or both.

  • In a between-subjects design , you compare the group-level outcomes of participants who have been exposed to different treatments (e.g., those who performed a meditation exercise vs those who didn’t).
  • In a within-subjects design , you compare repeated measures from participants who have participated in all treatments of a study (e.g., scores from before and after performing a meditation exercise).
  • In a mixed (factorial) design , one variable is altered between subjects and another is altered within subjects (e.g., pretest and posttest scores from participants who either did or didn’t do a meditation exercise).
  • Experimental
  • Correlational

First, you’ll take baseline test scores from participants. Then, your participants will undergo a 5-minute meditation exercise. Finally, you’ll record participants’ scores from a second math test.

In this experiment, the independent variable is the 5-minute meditation exercise, and the dependent variable is the math test score from before and after the intervention. Example: Correlational research design In a correlational study, you test whether there is a relationship between parental income and GPA in graduating college students. To collect your data, you will ask participants to fill in a survey and self-report their parents’ incomes and their own GPA.

Measuring variables

When planning a research design, you should operationalise your variables and decide exactly how you will measure them.

For statistical analysis, it’s important to consider the level of measurement of your variables, which tells you what kind of data they contain:

  • Categorical data represents groupings. These may be nominal (e.g., gender) or ordinal (e.g. level of language ability).
  • Quantitative data represents amounts. These may be on an interval scale (e.g. test score) or a ratio scale (e.g. age).

Many variables can be measured at different levels of precision. For example, age data can be quantitative (8 years old) or categorical (young). If a variable is coded numerically (e.g., level of agreement from 1–5), it doesn’t automatically mean that it’s quantitative instead of categorical.

Identifying the measurement level is important for choosing appropriate statistics and hypothesis tests. For example, you can calculate a mean score with quantitative data, but not with categorical data.

In a research study, along with measures of your variables of interest, you’ll often collect data on relevant participant characteristics.

Variable Type of data
Age Quantitative (ratio)
Gender Categorical (nominal)
Race or ethnicity Categorical (nominal)
Baseline test scores Quantitative (interval)
Final test scores Quantitative (interval)
Parental income Quantitative (ratio)
GPA Quantitative (interval)

Population vs sample

In most cases, it’s too difficult or expensive to collect data from every member of the population you’re interested in studying. Instead, you’ll collect data from a sample.

Statistical analysis allows you to apply your findings beyond your own sample as long as you use appropriate sampling procedures . You should aim for a sample that is representative of the population.

Sampling for statistical analysis

There are two main approaches to selecting a sample.

  • Probability sampling: every member of the population has a chance of being selected for the study through random selection.
  • Non-probability sampling: some members of the population are more likely than others to be selected for the study because of criteria such as convenience or voluntary self-selection.

In theory, for highly generalisable findings, you should use a probability sampling method. Random selection reduces sampling bias and ensures that data from your sample is actually typical of the population. Parametric tests can be used to make strong statistical inferences when data are collected using probability sampling.

But in practice, it’s rarely possible to gather the ideal sample. While non-probability samples are more likely to be biased, they are much easier to recruit and collect data from. Non-parametric tests are more appropriate for non-probability samples, but they result in weaker inferences about the population.

If you want to use parametric tests for non-probability samples, you have to make the case that:

  • your sample is representative of the population you’re generalising your findings to.
  • your sample lacks systematic bias.

Keep in mind that external validity means that you can only generalise your conclusions to others who share the characteristics of your sample. For instance, results from Western, Educated, Industrialised, Rich and Democratic samples (e.g., college students in the US) aren’t automatically applicable to all non-WEIRD populations.

If you apply parametric tests to data from non-probability samples, be sure to elaborate on the limitations of how far your results can be generalised in your discussion section .

Create an appropriate sampling procedure

Based on the resources available for your research, decide on how you’ll recruit participants.

  • Will you have resources to advertise your study widely, including outside of your university setting?
  • Will you have the means to recruit a diverse sample that represents a broad population?
  • Do you have time to contact and follow up with members of hard-to-reach groups?

Your participants are self-selected by their schools. Although you’re using a non-probability sample, you aim for a diverse and representative sample. Example: Sampling (correlational study) Your main population of interest is male college students in the US. Using social media advertising, you recruit senior-year male college students from a smaller subpopulation: seven universities in the Boston area.

Calculate sufficient sample size

Before recruiting participants, decide on your sample size either by looking at other studies in your field or using statistics. A sample that’s too small may be unrepresentative of the sample, while a sample that’s too large will be more costly than necessary.

There are many sample size calculators online. Different formulas are used depending on whether you have subgroups or how rigorous your study should be (e.g., in clinical research). As a rule of thumb, a minimum of 30 units or more per subgroup is necessary.

To use these calculators, you have to understand and input these key components:

  • Significance level (alpha): the risk of rejecting a true null hypothesis that you are willing to take, usually set at 5%.
  • Statistical power : the probability of your study detecting an effect of a certain size if there is one, usually 80% or higher.
  • Expected effect size : a standardised indication of how large the expected result of your study will be, usually based on other similar studies.
  • Population standard deviation: an estimate of the population parameter based on a previous study or a pilot study of your own.

Once you’ve collected all of your data, you can inspect them and calculate descriptive statistics that summarise them.

Inspect your data

There are various ways to inspect your data, including the following:

  • Organising data from each variable in frequency distribution tables .
  • Displaying data from a key variable in a bar chart to view the distribution of responses.
  • Visualising the relationship between two variables using a scatter plot .

By visualising your data in tables and graphs, you can assess whether your data follow a skewed or normal distribution and whether there are any outliers or missing data.

A normal distribution means that your data are symmetrically distributed around a center where most values lie, with the values tapering off at the tail ends.

Mean, median, mode, and standard deviation in a normal distribution

In contrast, a skewed distribution is asymmetric and has more values on one end than the other. The shape of the distribution is important to keep in mind because only some descriptive statistics should be used with skewed distributions.

Extreme outliers can also produce misleading statistics, so you may need a systematic approach to dealing with these values.

Calculate measures of central tendency

Measures of central tendency describe where most of the values in a data set lie. Three main measures of central tendency are often reported:

  • Mode : the most popular response or value in the data set.
  • Median : the value in the exact middle of the data set when ordered from low to high.
  • Mean : the sum of all values divided by the number of values.

However, depending on the shape of the distribution and level of measurement, only one or two of these measures may be appropriate. For example, many demographic characteristics can only be described using the mode or proportions, while a variable like reaction time may not have a mode at all.

Calculate measures of variability

Measures of variability tell you how spread out the values in a data set are. Four main measures of variability are often reported:

  • Range : the highest value minus the lowest value of the data set.
  • Interquartile range : the range of the middle half of the data set.
  • Standard deviation : the average distance between each value in your data set and the mean.
  • Variance : the square of the standard deviation.

Once again, the shape of the distribution and level of measurement should guide your choice of variability statistics. The interquartile range is the best measure for skewed distributions, while standard deviation and variance provide the best information for normal distributions.

Using your table, you should check whether the units of the descriptive statistics are comparable for pretest and posttest scores. For example, are the variance levels similar across the groups? Are there any extreme values? If there are, you may need to identify and remove extreme outliers in your data set or transform your data before performing a statistical test.

Pretest scores Posttest scores
Mean 68.44 75.25
Standard deviation 9.43 9.88
Variance 88.96 97.96
Range 36.25 45.12
30

From this table, we can see that the mean score increased after the meditation exercise, and the variances of the two scores are comparable. Next, we can perform a statistical test to find out if this improvement in test scores is statistically significant in the population. Example: Descriptive statistics (correlational study) After collecting data from 653 students, you tabulate descriptive statistics for annual parental income and GPA.

It’s important to check whether you have a broad range of data points. If you don’t, your data may be skewed towards some groups more than others (e.g., high academic achievers), and only limited inferences can be made about a relationship.

Parental income (USD) GPA
Mean 62,100 3.12
Standard deviation 15,000 0.45
Variance 225,000,000 0.16
Range 8,000–378,000 2.64–4.00
653

A number that describes a sample is called a statistic , while a number describing a population is called a parameter . Using inferential statistics , you can make conclusions about population parameters based on sample statistics.

Researchers often use two main methods (simultaneously) to make inferences in statistics.

  • Estimation: calculating population parameters based on sample statistics.
  • Hypothesis testing: a formal process for testing research predictions about the population using samples.

You can make two types of estimates of population parameters from sample statistics:

  • A point estimate : a value that represents your best guess of the exact parameter.
  • An interval estimate : a range of values that represent your best guess of where the parameter lies.

If your aim is to infer and report population characteristics from sample data, it’s best to use both point and interval estimates in your paper.

You can consider a sample statistic a point estimate for the population parameter when you have a representative sample (e.g., in a wide public opinion poll, the proportion of a sample that supports the current government is taken as the population proportion of government supporters).

There’s always error involved in estimation, so you should also provide a confidence interval as an interval estimate to show the variability around a point estimate.

A confidence interval uses the standard error and the z score from the standard normal distribution to convey where you’d generally expect to find the population parameter most of the time.

Hypothesis testing

Using data from a sample, you can test hypotheses about relationships between variables in the population. Hypothesis testing starts with the assumption that the null hypothesis is true in the population, and you use statistical tests to assess whether the null hypothesis can be rejected or not.

Statistical tests determine where your sample data would lie on an expected distribution of sample data if the null hypothesis were true. These tests give two main outputs:

  • A test statistic tells you how much your data differs from the null hypothesis of the test.
  • A p value tells you the likelihood of obtaining your results if the null hypothesis is actually true in the population.

Statistical tests come in three main varieties:

  • Comparison tests assess group differences in outcomes.
  • Regression tests assess cause-and-effect relationships between variables.
  • Correlation tests assess relationships between variables without assuming causation.

Your choice of statistical test depends on your research questions, research design, sampling method, and data characteristics.

Parametric tests

Parametric tests make powerful inferences about the population based on sample data. But to use them, some assumptions must be met, and only some types of variables can be used. If your data violate these assumptions, you can perform appropriate data transformations or use alternative non-parametric tests instead.

A regression models the extent to which changes in a predictor variable results in changes in outcome variable(s).

  • A simple linear regression includes one predictor variable and one outcome variable.
  • A multiple linear regression includes two or more predictor variables and one outcome variable.

Comparison tests usually compare the means of groups. These may be the means of different groups within a sample (e.g., a treatment and control group), the means of one sample group taken at different times (e.g., pretest and posttest scores), or a sample mean and a population mean.

  • A t test is for exactly 1 or 2 groups when the sample is small (30 or less).
  • A z test is for exactly 1 or 2 groups when the sample is large.
  • An ANOVA is for 3 or more groups.

The z and t tests have subtypes based on the number and types of samples and the hypotheses:

  • If you have only one sample that you want to compare to a population mean, use a one-sample test .
  • If you have paired measurements (within-subjects design), use a dependent (paired) samples test .
  • If you have completely separate measurements from two unmatched groups (between-subjects design), use an independent (unpaired) samples test .
  • If you expect a difference between groups in a specific direction, use a one-tailed test .
  • If you don’t have any expectations for the direction of a difference between groups, use a two-tailed test .

The only parametric correlation test is Pearson’s r . The correlation coefficient ( r ) tells you the strength of a linear relationship between two quantitative variables.

However, to test whether the correlation in the sample is strong enough to be important in the population, you also need to perform a significance test of the correlation coefficient, usually a t test, to obtain a p value. This test uses your sample size to calculate how much the correlation coefficient differs from zero in the population.

You use a dependent-samples, one-tailed t test to assess whether the meditation exercise significantly improved math test scores. The test gives you:

  • a t value (test statistic) of 3.00
  • a p value of 0.0028

Although Pearson’s r is a test statistic, it doesn’t tell you anything about how significant the correlation is in the population. You also need to test whether this sample correlation coefficient is large enough to demonstrate a correlation in the population.

A t test can also determine how significantly a correlation coefficient differs from zero based on sample size. Since you expect a positive correlation between parental income and GPA, you use a one-sample, one-tailed t test. The t test gives you:

  • a t value of 3.08
  • a p value of 0.001

The final step of statistical analysis is interpreting your results.

Statistical significance

In hypothesis testing, statistical significance is the main criterion for forming conclusions. You compare your p value to a set significance level (usually 0.05) to decide whether your results are statistically significant or non-significant.

Statistically significant results are considered unlikely to have arisen solely due to chance. There is only a very low chance of such a result occurring if the null hypothesis is true in the population.

This means that you believe the meditation intervention, rather than random factors, directly caused the increase in test scores. Example: Interpret your results (correlational study) You compare your p value of 0.001 to your significance threshold of 0.05. With a p value under this threshold, you can reject the null hypothesis. This indicates a statistically significant correlation between parental income and GPA in male college students.

Note that correlation doesn’t always mean causation, because there are often many underlying factors contributing to a complex variable like GPA. Even if one variable is related to another, this may be because of a third variable influencing both of them, or indirect links between the two variables.

Effect size

A statistically significant result doesn’t necessarily mean that there are important real life applications or clinical outcomes for a finding.

In contrast, the effect size indicates the practical significance of your results. It’s important to report effect sizes along with your inferential statistics for a complete picture of your results. You should also report interval estimates of effect sizes if you’re writing an APA style paper .

With a Cohen’s d of 0.72, there’s medium to high practical significance to your finding that the meditation exercise improved test scores. Example: Effect size (correlational study) To determine the effect size of the correlation coefficient, you compare your Pearson’s r value to Cohen’s effect size criteria.

Decision errors

Type I and Type II errors are mistakes made in research conclusions. A Type I error means rejecting the null hypothesis when it’s actually true, while a Type II error means failing to reject the null hypothesis when it’s false.

You can aim to minimise the risk of these errors by selecting an optimal significance level and ensuring high power . However, there’s a trade-off between the two errors, so a fine balance is necessary.

Frequentist versus Bayesian statistics

Traditionally, frequentist statistics emphasises null hypothesis significance testing and always starts with the assumption of a true null hypothesis.

However, Bayesian statistics has grown in popularity as an alternative approach in the last few decades. In this approach, you use previous research to continually update your hypotheses based on your expectations and observations.

Bayes factor compares the relative strength of evidence for the null versus the alternative hypothesis rather than making a conclusion about rejecting the null hypothesis or not.

Hypothesis testing is a formal procedure for investigating our ideas about the world using statistics. It is used by scientists to test specific predictions, called hypotheses , by calculating how likely it is that a pattern or relationship between variables could have arisen by chance.

The research methods you use depend on the type of data you need to answer your research question .

  • If you want to measure something or test a hypothesis , use quantitative methods . If you want to explore ideas, thoughts, and meanings, use qualitative methods .
  • If you want to analyse a large amount of readily available data, use secondary data. If you want data specific to your purposes with control over how they are generated, collect primary data.
  • If you want to establish cause-and-effect relationships between variables , use experimental methods. If you want to understand the characteristics of a research subject, use descriptive methods.

Statistical analysis is the main method for analyzing quantitative research data . It uses probabilities and models to test predictions about a population from sample data.

Is this article helpful?

Other students also liked, a quick guide to experimental design | 5 steps & examples, controlled experiments | methods & examples of control, between-subjects design | examples, pros & cons, more interesting articles.

  • Central Limit Theorem | Formula, Definition & Examples
  • Central Tendency | Understanding the Mean, Median & Mode
  • Correlation Coefficient | Types, Formulas & Examples
  • Descriptive Statistics | Definitions, Types, Examples
  • How to Calculate Standard Deviation (Guide) | Calculator & Examples
  • How to Calculate Variance | Calculator, Analysis & Examples
  • How to Find Degrees of Freedom | Definition & Formula
  • How to Find Interquartile Range (IQR) | Calculator & Examples
  • How to Find Outliers | Meaning, Formula & Examples
  • How to Find the Geometric Mean | Calculator & Formula
  • How to Find the Mean | Definition, Examples & Calculator
  • How to Find the Median | Definition, Examples & Calculator
  • How to Find the Range of a Data Set | Calculator & Formula
  • Inferential Statistics | An Easy Introduction & Examples
  • Levels of measurement: Nominal, ordinal, interval, ratio
  • Missing Data | Types, Explanation, & Imputation
  • Normal Distribution | Examples, Formulas, & Uses
  • Null and Alternative Hypotheses | Definitions & Examples
  • Poisson Distributions | Definition, Formula & Examples
  • Skewness | Definition, Examples & Formula
  • T-Distribution | What It Is and How To Use It (With Examples)
  • The Standard Normal Distribution | Calculator, Examples & Uses
  • Type I & Type II Errors | Differences, Examples, Visualizations
  • Understanding Confidence Intervals | Easy Examples & Formulas
  • Variability | Calculating Range, IQR, Variance, Standard Deviation
  • What is Effect Size and Why Does It Matter? (Examples)
  • What Is Interval Data? | Examples & Definition
  • What Is Nominal Data? | Examples & Definition
  • What Is Ordinal Data? | Examples & Definition
  • What Is Ratio Data? | Examples & Definition
  • What Is the Mode in Statistics? | Definition, Examples & Calculator

Frequently asked questions

What is statistical analysis.

Statistical analysis is the main method for analyzing quantitative research data . It uses probabilities and models to test predictions about a population from sample data.

Frequently asked questions: Statistics

As the degrees of freedom increase, Student’s t distribution becomes less leptokurtic , meaning that the probability of extreme values decreases. The distribution becomes more and more similar to a standard normal distribution .

The three categories of kurtosis are:

  • Mesokurtosis : An excess kurtosis of 0. Normal distributions are mesokurtic.
  • Platykurtosis : A negative excess kurtosis. Platykurtic distributions are thin-tailed, meaning that they have few outliers .
  • Leptokurtosis : A positive excess kurtosis. Leptokurtic distributions are fat-tailed, meaning that they have many outliers.

Probability distributions belong to two broad categories: discrete probability distributions and continuous probability distributions . Within each category, there are many types of probability distributions.

Probability is the relative frequency over an infinite number of trials.

For example, the probability of a coin landing on heads is .5, meaning that if you flip the coin an infinite number of times, it will land on heads half the time.

Since doing something an infinite number of times is impossible, relative frequency is often used as an estimate of probability. If you flip a coin 1000 times and get 507 heads, the relative frequency, .507, is a good estimate of the probability.

Categorical variables can be described by a frequency distribution. Quantitative variables can also be described by a frequency distribution, but first they need to be grouped into interval classes .

A histogram is an effective way to tell if a frequency distribution appears to have a normal distribution .

Plot a histogram and look at the shape of the bars. If the bars roughly follow a symmetrical bell or hill shape, like the example below, then the distribution is approximately normally distributed.

Frequency-distribution-Normal-distribution

You can use the CHISQ.INV.RT() function to find a chi-square critical value in Excel.

For example, to calculate the chi-square critical value for a test with df = 22 and α = .05, click any blank cell and type:

=CHISQ.INV.RT(0.05,22)

You can use the qchisq() function to find a chi-square critical value in R.

For example, to calculate the chi-square critical value for a test with df = 22 and α = .05:

qchisq(p = .05, df = 22, lower.tail = FALSE)

You can use the chisq.test() function to perform a chi-square test of independence in R. Give the contingency table as a matrix for the “x” argument. For example:

m = matrix(data = c(89, 84, 86, 9, 8, 24), nrow = 3, ncol = 2)

chisq.test(x = m)

You can use the CHISQ.TEST() function to perform a chi-square test of independence in Excel. It takes two arguments, CHISQ.TEST(observed_range, expected_range), and returns the p value.

Chi-square goodness of fit tests are often used in genetics. One common application is to check if two genes are linked (i.e., if the assortment is independent). When genes are linked, the allele inherited for one gene affects the allele inherited for another gene.

Suppose that you want to know if the genes for pea texture (R = round, r = wrinkled) and color (Y = yellow, y = green) are linked. You perform a dihybrid cross between two heterozygous ( RY / ry ) pea plants. The hypotheses you’re testing with your experiment are:

  • This would suggest that the genes are unlinked.
  • This would suggest that the genes are linked.

You observe 100 peas:

  • 78 round and yellow peas
  • 6 round and green peas
  • 4 wrinkled and yellow peas
  • 12 wrinkled and green peas

Step 1: Calculate the expected frequencies

To calculate the expected values, you can make a Punnett square. If the two genes are unlinked, the probability of each genotypic combination is equal.

RRYY RrYy RRYy RrYY
RrYy rryy Rryy rrYy
RRYy Rryy RRyy RrYy
RrYY rrYy RrYy rrYY

The expected phenotypic ratios are therefore 9 round and yellow: 3 round and green: 3 wrinkled and yellow: 1 wrinkled and green.

From this, you can calculate the expected phenotypic frequencies for 100 peas:

Round and yellow 78 100 * (9/16) = 56.25
Round and green 6 100 * (3/16) = 18.75
Wrinkled and yellow 4 100 * (3/16) = 18.75
Wrinkled and green 12 100 * (1/16) = 6.21

Step 2: Calculate chi-square

Round and yellow 78 56.25 21.75 473.06 8.41
Round and green 6 18.75 −12.75 162.56 8.67
Wrinkled and yellow 4 18.75 −14.75 217.56 11.6
Wrinkled and green 12 6.21 5.79 33.52 5.4

Χ 2 = 8.41 + 8.67 + 11.6 + 5.4 = 34.08

Step 3: Find the critical chi-square value

Since there are four groups (round and yellow, round and green, wrinkled and yellow, wrinkled and green), there are three degrees of freedom .

For a test of significance at α = .05 and df = 3, the Χ 2 critical value is 7.82.

Step 4: Compare the chi-square value to the critical value

Χ 2 = 34.08

Critical value = 7.82

The Χ 2 value is greater than the critical value .

Step 5: Decide whether the reject the null hypothesis

The Χ 2 value is greater than the critical value, so we reject the null hypothesis that the population of offspring have an equal probability of inheriting all possible genotypic combinations. There is a significant difference between the observed and expected genotypic frequencies ( p < .05).

The data supports the alternative hypothesis that the offspring do not have an equal probability of inheriting all possible genotypic combinations, which suggests that the genes are linked

You can use the chisq.test() function to perform a chi-square goodness of fit test in R. Give the observed values in the “x” argument, give the expected values in the “p” argument, and set “rescale.p” to true. For example:

chisq.test(x = c(22,30,23), p = c(25,25,25), rescale.p = TRUE)

You can use the CHISQ.TEST() function to perform a chi-square goodness of fit test in Excel. It takes two arguments, CHISQ.TEST(observed_range, expected_range), and returns the p value .

Both correlations and chi-square tests can test for relationships between two variables. However, a correlation is used when you have two quantitative variables and a chi-square test of independence is used when you have two categorical variables.

Both chi-square tests and t tests can test for differences between two groups. However, a t test is used when you have a dependent quantitative variable and an independent categorical variable (with two groups). A chi-square test of independence is used when you have two categorical variables.

The two main chi-square tests are the chi-square goodness of fit test and the chi-square test of independence .

A chi-square distribution is a continuous probability distribution . The shape of a chi-square distribution depends on its degrees of freedom , k . The mean of a chi-square distribution is equal to its degrees of freedom ( k ) and the variance is 2 k . The range is 0 to ∞.

As the degrees of freedom ( k ) increases, the chi-square distribution goes from a downward curve to a hump shape. As the degrees of freedom increases further, the hump goes from being strongly right-skewed to being approximately normal.

To find the quartiles of a probability distribution, you can use the distribution’s quantile function.

You can use the quantile() function to find quartiles in R. If your data is called “data”, then “quantile(data, prob=c(.25,.5,.75), type=1)” will return the three quartiles.

You can use the QUARTILE() function to find quartiles in Excel. If your data is in column A, then click any blank cell and type “=QUARTILE(A:A,1)” for the first quartile, “=QUARTILE(A:A,2)” for the second quartile, and “=QUARTILE(A:A,3)” for the third quartile.

You can use the PEARSON() function to calculate the Pearson correlation coefficient in Excel. If your variables are in columns A and B, then click any blank cell and type “PEARSON(A:A,B:B)”.

There is no function to directly test the significance of the correlation.

You can use the cor() function to calculate the Pearson correlation coefficient in R. To test the significance of the correlation, you can use the cor.test() function.

You should use the Pearson correlation coefficient when (1) the relationship is linear and (2) both variables are quantitative and (3) normally distributed and (4) have no outliers.

The Pearson correlation coefficient ( r ) is the most common way of measuring a linear correlation. It is a number between –1 and 1 that measures the strength and direction of the relationship between two variables.

This table summarizes the most important differences between normal distributions and Poisson distributions :

Characteristic Normal Poisson
Continuous
Mean (µ) and standard deviation (σ) Lambda (λ)
Shape Bell-shaped Depends on λ
Symmetrical Asymmetrical (right-skewed). As λ increases, the asymmetry decreases.
Range −∞ to ∞ 0 to ∞

When the mean of a Poisson distribution is large (>10), it can be approximated by a normal distribution.

In the Poisson distribution formula, lambda (λ) is the mean number of events within a given interval of time or space. For example, λ = 0.748 floods per year.

The e in the Poisson distribution formula stands for the number 2.718. This number is called Euler’s constant. You can simply substitute e with 2.718 when you’re calculating a Poisson probability. Euler’s constant is a very useful number and is especially important in calculus.

The three types of skewness are:

  • Right skew (also called positive skew ) . A right-skewed distribution is longer on the right side of its peak than on its left.
  • Left skew (also called negative skew). A left-skewed distribution is longer on the left side of its peak than on its right.
  • Zero skew. It is symmetrical and its left and right sides are mirror images.

Skewness of a distribution

Skewness and kurtosis are both important measures of a distribution’s shape.

  • Skewness measures the asymmetry of a distribution.
  • Kurtosis measures the heaviness of a distribution’s tails relative to a normal distribution .

Difference between skewness and kurtosis

A research hypothesis is your proposed answer to your research question. The research hypothesis usually includes an explanation (“ x affects y because …”).

A statistical hypothesis, on the other hand, is a mathematical statement about a population parameter. Statistical hypotheses always come in pairs: the null and alternative hypotheses . In a well-designed study , the statistical hypotheses correspond logically to the research hypothesis.

The alternative hypothesis is often abbreviated as H a or H 1 . When the alternative hypothesis is written using mathematical symbols, it always includes an inequality symbol (usually ≠, but sometimes < or >).

The null hypothesis is often abbreviated as H 0 . When the null hypothesis is written using mathematical symbols, it always includes an equality symbol (usually =, but sometimes ≥ or ≤).

The t distribution was first described by statistician William Sealy Gosset under the pseudonym “Student.”

To calculate a confidence interval of a mean using the critical value of t , follow these four steps:

  • Choose the significance level based on your desired confidence level. The most common confidence level is 95%, which corresponds to α = .05 in the two-tailed t table .
  • Find the critical value of t in the two-tailed t table.
  • Multiply the critical value of t by s / √ n .
  • Add this value to the mean to calculate the upper limit of the confidence interval, and subtract this value from the mean to calculate the lower limit.

To test a hypothesis using the critical value of t , follow these four steps:

  • Calculate the t value for your sample.
  • Find the critical value of t in the t table .
  • Determine if the (absolute) t value is greater than the critical value of t .
  • Reject the null hypothesis if the sample’s t value is greater than the critical value of t . Otherwise, don’t reject the null hypothesis .

You can use the T.INV() function to find the critical value of t for one-tailed tests in Excel, and you can use the T.INV.2T() function for two-tailed tests.

You can use the qt() function to find the critical value of t in R. The function gives the critical value of t for the one-tailed test. If you want the critical value of t for a two-tailed test, divide the significance level by two.

You can use the RSQ() function to calculate R² in Excel. If your dependent variable is in column A and your independent variable is in column B, then click any blank cell and type “RSQ(A:A,B:B)”.

You can use the summary() function to view the R²  of a linear model in R. You will see the “R-squared” near the bottom of the output.

There are two formulas you can use to calculate the coefficient of determination (R²) of a simple linear regression .

R^2=(r)^2

The coefficient of determination (R²) is a number between 0 and 1 that measures how well a statistical model predicts an outcome. You can interpret the R² as the proportion of variation in the dependent variable that is predicted by the statistical model.

There are three main types of missing data .

Missing completely at random (MCAR) data are randomly distributed across the variable and unrelated to other variables .

Missing at random (MAR) data are not randomly distributed but they are accounted for by other observed variables.

Missing not at random (MNAR) data systematically differ from the observed values.

To tidy up your missing data , your options usually include accepting, removing, or recreating the missing data.

  • Acceptance: You leave your data as is
  • Listwise or pairwise deletion: You delete all cases (participants) with missing data from analyses
  • Imputation: You use other data to fill in the missing data

Missing data are important because, depending on the type, they can sometimes bias your results. This means your results may not be generalizable outside of your study because your data come from an unrepresentative sample .

Missing data , or missing values, occur when you don’t have data stored for certain variables or participants.

In any dataset, there’s usually some missing data. In quantitative research , missing values appear as blank cells in your spreadsheet.

There are two steps to calculating the geometric mean :

  • Multiply all values together to get their product.
  • Find the n th root of the product ( n is the number of values).

Before calculating the geometric mean, note that:

  • The geometric mean can only be found for positive values.
  • If any value in the data set is zero, the geometric mean is zero.

The arithmetic mean is the most commonly used type of mean and is often referred to simply as “the mean.” While the arithmetic mean is based on adding and dividing values, the geometric mean multiplies and finds the root of values.

Even though the geometric mean is a less common measure of central tendency , it’s more accurate than the arithmetic mean for percentage change and positively skewed data. The geometric mean is often reported for financial indices and population growth rates.

The geometric mean is an average that multiplies all values and finds a root of the number. For a dataset with n numbers, you find the n th root of their product.

Outliers are extreme values that differ from most values in the dataset. You find outliers at the extreme ends of your dataset.

It’s best to remove outliers only when you have a sound reason for doing so.

Some outliers represent natural variations in the population , and they should be left as is in your dataset. These are called true outliers.

Other outliers are problematic and should be removed because they represent measurement errors , data entry or processing errors, or poor sampling.

You can choose from four main ways to detect outliers :

  • Sorting your values from low to high and checking minimum and maximum values
  • Visualizing your data with a box plot and looking for outliers
  • Using the interquartile range to create fences for your data
  • Using statistical procedures to identify extreme values

Outliers can have a big impact on your statistical analyses and skew the results of any hypothesis test if they are inaccurate.

These extreme values can impact your statistical power as well, making it hard to detect a true effect if there is one.

No, the steepness or slope of the line isn’t related to the correlation coefficient value. The correlation coefficient only tells you how closely your data fit on a line, so two datasets with the same correlation coefficient can have very different slopes.

To find the slope of the line, you’ll need to perform a regression analysis .

Correlation coefficients always range between -1 and 1.

The sign of the coefficient tells you the direction of the relationship: a positive value means the variables change together in the same direction, while a negative value means they change together in opposite directions.

The absolute value of a number is equal to the number without its sign. The absolute value of a correlation coefficient tells you the magnitude of the correlation: the greater the absolute value, the stronger the correlation.

These are the assumptions your data must meet if you want to use Pearson’s r :

  • Both variables are on an interval or ratio level of measurement
  • Data from both variables follow normal distributions
  • Your data have no outliers
  • Your data is from a random or representative sample
  • You expect a linear relationship between the two variables

A correlation coefficient is a single number that describes the strength and direction of the relationship between your variables.

Different types of correlation coefficients might be appropriate for your data based on their levels of measurement and distributions . The Pearson product-moment correlation coefficient (Pearson’s r ) is commonly used to assess a linear relationship between two quantitative variables.

There are various ways to improve power:

  • Increase the potential effect size by manipulating your independent variable more strongly,
  • Increase sample size,
  • Increase the significance level (alpha),
  • Reduce measurement error by increasing the precision and accuracy of your measurement devices and procedures,
  • Use a one-tailed test instead of a two-tailed test for t tests and z tests.

A power analysis is a calculation that helps you determine a minimum sample size for your study. It’s made up of four main components. If you know or have estimates for any three of these, you can calculate the fourth component.

  • Statistical power : the likelihood that a test will detect an effect of a certain size if there is one, usually set at 80% or higher.
  • Sample size : the minimum number of observations needed to observe an effect of a certain size with a given power level.
  • Significance level (alpha) : the maximum risk of rejecting a true null hypothesis that you are willing to take, usually set at 5%.
  • Expected effect size : a standardized way of expressing the magnitude of the expected result of your study, usually based on similar studies or a pilot study.

Null and alternative hypotheses are used in statistical hypothesis testing . The null hypothesis of a test always predicts no effect or no relationship between variables, while the alternative hypothesis states your research prediction of an effect or relationship.

The risk of making a Type II error is inversely related to the statistical power of a test. Power is the extent to which a test can correctly detect a real effect when there is one.

To (indirectly) reduce the risk of a Type II error, you can increase the sample size or the significance level to increase statistical power.

The risk of making a Type I error is the significance level (or alpha) that you choose. That’s a value that you set at the beginning of your study to assess the statistical probability of obtaining your results ( p value ).

The significance level is usually set at 0.05 or 5%. This means that your results only have a 5% chance of occurring, or less, if the null hypothesis is actually true.

To reduce the Type I error probability, you can set a lower significance level.

In statistics, a Type I error means rejecting the null hypothesis when it’s actually true, while a Type II error means failing to reject the null hypothesis when it’s actually false.

In statistics, power refers to the likelihood of a hypothesis test detecting a true effect if there is one. A statistically powerful test is more likely to reject a false negative (a Type II error).

If you don’t ensure enough power in your study, you may not be able to detect a statistically significant result even when it has practical significance. Your study might not have the ability to answer your research question.

While statistical significance shows that an effect exists in a study, practical significance shows that the effect is large enough to be meaningful in the real world.

Statistical significance is denoted by p -values whereas practical significance is represented by effect sizes .

There are dozens of measures of effect sizes . The most common effect sizes are Cohen’s d and Pearson’s r . Cohen’s d measures the size of the difference between two groups while Pearson’s r measures the strength of the relationship between two variables .

Effect size tells you how meaningful the relationship between variables or the difference between groups is.

A large effect size means that a research finding has practical significance, while a small effect size indicates limited practical applications.

Using descriptive and inferential statistics , you can make two types of estimates about the population : point estimates and interval estimates.

  • A point estimate is a single value estimate of a parameter . For instance, a sample mean is a point estimate of a population mean.
  • An interval estimate gives you a range of values where the parameter is expected to lie. A confidence interval is the most common type of interval estimate.

Both types of estimates are important for gathering a clear idea of where a parameter is likely to lie.

Standard error and standard deviation are both measures of variability . The standard deviation reflects variability within a sample, while the standard error estimates the variability across samples of a population.

The standard error of the mean , or simply standard error , indicates how different the population mean is likely to be from a sample mean. It tells you how much the sample mean would vary if you were to repeat a study using new samples from within a single population.

To figure out whether a given number is a parameter or a statistic , ask yourself the following:

  • Does the number describe a whole, complete population where every member can be reached for data collection ?
  • Is it possible to collect data for this number from every member of the population in a reasonable time frame?

If the answer is yes to both questions, the number is likely to be a parameter. For small populations, data can be collected from the whole population and summarized in parameters.

If the answer is no to either of the questions, then the number is more likely to be a statistic.

The arithmetic mean is the most commonly used mean. It’s often simply called the mean or the average. But there are some other types of means you can calculate depending on your research purposes:

  • Weighted mean: some values contribute more to the mean than others.
  • Geometric mean : values are multiplied rather than summed up.
  • Harmonic mean: reciprocals of values are used instead of the values themselves.

You can find the mean , or average, of a data set in two simple steps:

  • Find the sum of the values by adding them all up.
  • Divide the sum by the number of values in the data set.

This method is the same whether you are dealing with sample or population data or positive or negative numbers.

The median is the most informative measure of central tendency for skewed distributions or distributions with outliers. For example, the median is often used as a measure of central tendency for income distributions, which are generally highly skewed.

Because the median only uses one or two values, it’s unaffected by extreme outliers or non-symmetric distributions of scores. In contrast, the mean and mode can vary in skewed distributions.

To find the median , first order your data. Then calculate the middle position based on n , the number of values in your data set.

\dfrac{(n+1)}{2}

A data set can often have no mode, one mode or more than one mode – it all depends on how many different values repeat most frequently.

Your data can be:

  • without any mode
  • unimodal, with one mode,
  • bimodal, with two modes,
  • trimodal, with three modes, or
  • multimodal, with four or more modes.

To find the mode :

  • If your data is numerical or quantitative, order the values from low to high.
  • If it is categorical, sort the values by group, in any order.

Then you simply need to identify the most frequently occurring value.

The interquartile range is the best measure of variability for skewed distributions or data sets with outliers. Because it’s based on values that come from the middle half of the distribution, it’s unlikely to be influenced by outliers .

The two most common methods for calculating interquartile range are the exclusive and inclusive methods.

The exclusive method excludes the median when identifying Q1 and Q3, while the inclusive method includes the median as a value in the data set in identifying the quartiles.

For each of these methods, you’ll need different procedures for finding the median, Q1 and Q3 depending on whether your sample size is even- or odd-numbered. The exclusive method works best for even-numbered sample sizes, while the inclusive method is often used with odd-numbered sample sizes.

While the range gives you the spread of the whole data set, the interquartile range gives you the spread of the middle half of a data set.

Homoscedasticity, or homogeneity of variances, is an assumption of equal or similar variances in different groups being compared.

This is an important assumption of parametric statistical tests because they are sensitive to any dissimilarities. Uneven variances in samples result in biased and skewed test results.

Statistical tests such as variance tests or the analysis of variance (ANOVA) use sample variance to assess group differences of populations. They use the variances of the samples to assess whether the populations they come from significantly differ from each other.

Variance is the average squared deviations from the mean, while standard deviation is the square root of this number. Both measures reflect variability in a distribution, but their units differ:

  • Standard deviation is expressed in the same units as the original values (e.g., minutes or meters).
  • Variance is expressed in much larger units (e.g., meters squared).

Although the units of variance are harder to intuitively understand, variance is important in statistical tests .

The empirical rule, or the 68-95-99.7 rule, tells you where most of the values lie in a normal distribution :

  • Around 68% of values are within 1 standard deviation of the mean.
  • Around 95% of values are within 2 standard deviations of the mean.
  • Around 99.7% of values are within 3 standard deviations of the mean.

The empirical rule is a quick way to get an overview of your data and check for any outliers or extreme values that don’t follow this pattern.

In a normal distribution , data are symmetrically distributed with no skew. Most values cluster around a central region, with values tapering off as they go further away from the center.

The measures of central tendency (mean, mode, and median) are exactly the same in a normal distribution.

Normal distribution

The standard deviation is the average amount of variability in your data set. It tells you, on average, how far each score lies from the mean .

In normal distributions, a high standard deviation means that values are generally far from the mean, while a low standard deviation indicates that values are clustered close to the mean.

No. Because the range formula subtracts the lowest number from the highest number, the range is always zero or a positive number.

In statistics, the range is the spread of your data from the lowest to the highest value in the distribution. It is the simplest measure of variability .

While central tendency tells you where most of your data points lie, variability summarizes how far apart your points from each other.

Data sets can have the same central tendency but different levels of variability or vice versa . Together, they give you a complete picture of your data.

Variability is most commonly measured with the following descriptive statistics :

  • Range : the difference between the highest and lowest values
  • Interquartile range : the range of the middle half of a distribution
  • Standard deviation : average distance from the mean
  • Variance : average of squared distances from the mean

Variability tells you how far apart points lie from each other and from the center of a distribution or a data set.

Variability is also referred to as spread, scatter or dispersion.

While interval and ratio data can both be categorized, ranked, and have equal spacing between adjacent values, only ratio scales have a true zero.

For example, temperature in Celsius or Fahrenheit is at an interval scale because zero is not the lowest possible temperature. In the Kelvin scale, a ratio scale, zero represents a total lack of thermal energy.

A critical value is the value of the test statistic which defines the upper and lower bounds of a confidence interval , or which defines the threshold of statistical significance in a statistical test. It describes how far from the mean of the distribution you have to go to cover a certain amount of the total variation in the data (i.e. 90%, 95%, 99%).

If you are constructing a 95% confidence interval and are using a threshold of statistical significance of p = 0.05, then your critical value will be identical in both cases.

The t -distribution gives more probability to observations in the tails of the distribution than the standard normal distribution (a.k.a. the z -distribution).

In this way, the t -distribution is more conservative than the standard normal distribution: to reach the same level of confidence or statistical significance , you will need to include a wider range of the data.

A t -score (a.k.a. a t -value) is equivalent to the number of standard deviations away from the mean of the t -distribution .

The t -score is the test statistic used in t -tests and regression tests. It can also be used to describe how far from the mean an observation is when the data follow a t -distribution.

The t -distribution is a way of describing a set of observations where most observations fall close to the mean , and the rest of the observations make up the tails on either side. It is a type of normal distribution used for smaller sample sizes, where the variance in the data is unknown.

The t -distribution forms a bell curve when plotted on a graph. It can be described mathematically using the mean and the standard deviation .

In statistics, ordinal and nominal variables are both considered categorical variables .

Even though ordinal data can sometimes be numerical, not all mathematical operations can be performed on them.

Ordinal data has two characteristics:

  • The data can be classified into different categories within a variable.
  • The categories have a natural ranked order.

However, unlike with interval data, the distances between the categories are uneven or unknown.

Nominal and ordinal are two of the four levels of measurement . Nominal level data can only be classified, while ordinal level data can be classified and ordered.

Nominal data is data that can be labelled or classified into mutually exclusive categories within a variable. These categories cannot be ordered in a meaningful way.

For example, for the nominal variable of preferred mode of transportation, you may have the categories of car, bus, train, tram or bicycle.

If your confidence interval for a difference between groups includes zero, that means that if you run your experiment again you have a good chance of finding no difference between groups.

If your confidence interval for a correlation or regression includes zero, that means that if you run your experiment again there is a good chance of finding no correlation in your data.

In both of these cases, you will also find a high p -value when you run your statistical test, meaning that your results could have occurred under the null hypothesis of no relationship between variables or no difference between groups.

If you want to calculate a confidence interval around the mean of data that is not normally distributed , you have two choices:

  • Find a distribution that matches the shape of your data and use that distribution to calculate the confidence interval.
  • Perform a transformation on your data to make it fit a normal distribution, and then find the confidence interval for the transformed data.

The standard normal distribution , also called the z -distribution, is a special normal distribution where the mean is 0 and the standard deviation is 1.

Any normal distribution can be converted into the standard normal distribution by turning the individual values into z -scores. In a z -distribution, z -scores tell you how many standard deviations away from the mean each value lies.

The z -score and t -score (aka z -value and t -value) show how many standard deviations away from the mean of the distribution you are, assuming your data follow a z -distribution or a t -distribution .

These scores are used in statistical tests to show how far from the mean of the predicted distribution your statistical estimate is. If your test produces a z -score of 2.5, this means that your estimate is 2.5 standard deviations from the predicted mean.

The predicted mean and distribution of your estimate are generated by the null hypothesis of the statistical test you are using. The more standard deviations away from the predicted mean your estimate is, the less likely it is that the estimate could have occurred under the null hypothesis .

To calculate the confidence interval , you need to know:

  • The point estimate you are constructing the confidence interval for
  • The critical values for the test statistic
  • The standard deviation of the sample
  • The sample size

Then you can plug these components into the confidence interval formula that corresponds to your data. The formula depends on the type of estimate (e.g. a mean or a proportion) and on the distribution of your data.

The confidence level is the percentage of times you expect to get close to the same estimate if you run your experiment again or resample the population in the same way.

The confidence interval consists of the upper and lower bounds of the estimate you expect to find at a given level of confidence.

For example, if you are estimating a 95% confidence interval around the mean proportion of female babies born every year based on a random sample of babies, you might find an upper bound of 0.56 and a lower bound of 0.48. These are the upper and lower bounds of the confidence interval. The confidence level is 95%.

The mean is the most frequently used measure of central tendency because it uses all values in the data set to give you an average.

For data from skewed distributions, the median is better than the mean because it isn’t influenced by extremely large values.

The mode is the only measure you can use for nominal or categorical data that can’t be ordered.

The measures of central tendency you can use depends on the level of measurement of your data.

  • For a nominal level, you can only use the mode to find the most frequent value.
  • For an ordinal level or ranked data, you can also use the median to find the value in the middle of your data set.
  • For interval or ratio levels, in addition to the mode and median, you can use the mean to find the average value.

Measures of central tendency help you find the middle, or the average, of a data set.

The 3 most common measures of central tendency are the mean, median and mode.

  • The mode is the most frequent value.
  • The median is the middle number in an ordered data set.
  • The mean is the sum of all values divided by the total number of values.

Some variables have fixed levels. For example, gender and ethnicity are always nominal level data because they cannot be ranked.

However, for other variables, you can choose the level of measurement . For example, income is a variable that can be recorded on an ordinal or a ratio scale:

  • At an ordinal level , you could create 5 income groupings and code the incomes that fall within them from 1–5.
  • At a ratio level , you would record exact numbers for income.

If you have a choice, the ratio level is always preferable because you can analyze data in more ways. The higher the level of measurement, the more precise your data is.

The level at which you measure a variable determines how you can analyze your data.

Depending on the level of measurement , you can perform different descriptive statistics to get an overall summary of your data and inferential statistics to see if your results support or refute your hypothesis .

Levels of measurement tell you how precisely variables are recorded. There are 4 levels of measurement, which can be ranked from low to high:

  • Nominal : the data can only be categorized.
  • Ordinal : the data can be categorized and ranked.
  • Interval : the data can be categorized and ranked, and evenly spaced.
  • Ratio : the data can be categorized, ranked, evenly spaced and has a natural zero.

No. The p -value only tells you how likely the data you have observed is to have occurred under the null hypothesis .

If the p -value is below your threshold of significance (typically p < 0.05), then you can reject the null hypothesis, but this does not necessarily mean that your alternative hypothesis is true.

The alpha value, or the threshold for statistical significance , is arbitrary – which value you use depends on your field of study.

In most cases, researchers use an alpha of 0.05, which means that there is a less than 5% chance that the data being tested could have occurred under the null hypothesis.

P -values are usually automatically calculated by the program you use to perform your statistical test. They can also be estimated using p -value tables for the relevant test statistic .

P -values are calculated from the null distribution of the test statistic. They tell you how often a test statistic is expected to occur under the null hypothesis of the statistical test, based on where it falls in the null distribution.

If the test statistic is far from the mean of the null distribution, then the p -value will be small, showing that the test statistic is not likely to have occurred under the null hypothesis.

A p -value , or probability value, is a number describing how likely it is that your data would have occurred under the null hypothesis of your statistical test .

The test statistic you use will be determined by the statistical test.

You can choose the right statistical test by looking at what type of data you have collected and what type of relationship you want to test.

The test statistic will change based on the number of observations in your data, how variable your observations are, and how strong the underlying patterns in the data are.

For example, if one data set has higher variability while another has lower variability, the first data set will produce a test statistic closer to the null hypothesis , even if the true correlation between two variables is the same in either data set.

The formula for the test statistic depends on the statistical test being used.

Generally, the test statistic is calculated as the pattern in your data (i.e. the correlation between variables or difference between groups) divided by the variance in the data (i.e. the standard deviation ).

  • Univariate statistics summarize only one variable  at a time.
  • Bivariate statistics compare two variables .
  • Multivariate statistics compare more than two variables .

The 3 main types of descriptive statistics concern the frequency distribution, central tendency, and variability of a dataset.

  • Distribution refers to the frequencies of different responses.
  • Measures of central tendency give you the average for each response.
  • Measures of variability show you the spread or dispersion of your dataset.

Descriptive statistics summarize the characteristics of a data set. Inferential statistics allow you to test a hypothesis or assess whether your data is generalizable to the broader population.

In statistics, model selection is a process researchers use to compare the relative value of different statistical models and determine which one is the best fit for the observed data.

The Akaike information criterion is one of the most common methods of model selection. AIC weights the ability of the model to predict the observed data against the number of parameters the model requires to reach that level of precision.

AIC model selection can help researchers find a model that explains the observed variation in their data while avoiding overfitting.

In statistics, a model is the collection of one or more independent variables and their predicted interactions that researchers use to try to explain variation in their dependent variable.

You can test a model using a statistical test . To compare how well different models fit your data, you can use Akaike’s information criterion for model selection.

The Akaike information criterion is calculated from the maximum log-likelihood of the model and the number of parameters (K) used to reach that likelihood. The AIC function is 2K – 2(log-likelihood) .

Lower AIC values indicate a better-fit model, and a model with a delta-AIC (the difference between the two AIC values being compared) of more than -2 is considered significantly better than the model it is being compared to.

The Akaike information criterion is a mathematical test used to evaluate how well a model fits the data it is meant to describe. It penalizes models which use more independent variables (parameters) as a way to avoid over-fitting.

AIC is most often used to compare the relative goodness-of-fit among different models under consideration and to then choose the model that best fits the data.

A factorial ANOVA is any ANOVA that uses more than one categorical independent variable . A two-way ANOVA is a type of factorial ANOVA.

Some examples of factorial ANOVAs include:

  • Testing the combined effects of vaccination (vaccinated or not vaccinated) and health status (healthy or pre-existing condition) on the rate of flu infection in a population.
  • Testing the effects of marital status (married, single, divorced, widowed), job status (employed, self-employed, unemployed, retired), and family history (no family history, some family history) on the incidence of depression in a population.
  • Testing the effects of feed type (type A, B, or C) and barn crowding (not crowded, somewhat crowded, very crowded) on the final weight of chickens in a commercial farming operation.

In ANOVA, the null hypothesis is that there is no difference among group means. If any group differs significantly from the overall group mean, then the ANOVA will report a statistically significant result.

Significant differences among group means are calculated using the F statistic, which is the ratio of the mean sum of squares (the variance explained by the independent variable) to the mean square error (the variance left over).

If the F statistic is higher than the critical value (the value of F that corresponds with your alpha value, usually 0.05), then the difference among groups is deemed statistically significant.

The only difference between one-way and two-way ANOVA is the number of independent variables . A one-way ANOVA has one independent variable, while a two-way ANOVA has two.

  • One-way ANOVA : Testing the relationship between shoe brand (Nike, Adidas, Saucony, Hoka) and race finish times in a marathon.
  • Two-way ANOVA : Testing the relationship between shoe brand (Nike, Adidas, Saucony, Hoka), runner age group (junior, senior, master’s), and race finishing times in a marathon.

All ANOVAs are designed to test for differences among three or more groups. If you are only testing for a difference between two groups, use a t-test instead.

Multiple linear regression is a regression model that estimates the relationship between a quantitative dependent variable and two or more independent variables using a straight line.

Linear regression most often uses mean-square error (MSE) to calculate the error of the model. MSE is calculated by:

  • measuring the distance of the observed y-values from the predicted y-values at each value of x;
  • squaring each of these distances;
  • calculating the mean of each of the squared distances.

Linear regression fits a line to the data by finding the regression coefficient that results in the smallest MSE.

Simple linear regression is a regression model that estimates the relationship between one independent variable and one dependent variable using a straight line. Both variables should be quantitative.

For example, the relationship between temperature and the expansion of mercury in a thermometer can be modeled using a straight line: as temperature increases, the mercury expands. This linear relationship is so certain that we can use mercury thermometers to measure temperature.

A regression model is a statistical model that estimates the relationship between one dependent variable and one or more independent variables using a line (or a plane in the case of two or more independent variables).

A regression model can be used when the dependent variable is quantitative, except in the case of logistic regression, where the dependent variable is binary.

A t-test should not be used to measure differences among more than two groups, because the error structure for a t-test will underestimate the actual error when many groups are being compared.

If you want to compare the means of several groups at once, it’s best to use another statistical test such as ANOVA or a post-hoc test.

A one-sample t-test is used to compare a single population to a standard value (for example, to determine whether the average lifespan of a specific town is different from the country average).

A paired t-test is used to compare a single population before and after some experimental intervention or at two different points in time (for example, measuring student performance on a test before and after being taught the material).

A t-test measures the difference in group means divided by the pooled standard error of the two group means.

In this way, it calculates a number (the t-value) illustrating the magnitude of the difference between the two group means being compared, and estimates the likelihood that this difference exists purely by chance (p-value).

Your choice of t-test depends on whether you are studying one group or two groups, and whether you care about the direction of the difference in group means.

If you are studying one group, use a paired t-test to compare the group mean over time or after an intervention, or use a one-sample t-test to compare the group mean to a standard value. If you are studying two groups, use a two-sample t-test .

If you want to know only whether a difference exists, use a two-tailed test . If you want to know if one group mean is greater or less than the other, use a left-tailed or right-tailed one-tailed test .

A t-test is a statistical test that compares the means of two samples . It is used in hypothesis testing , with a null hypothesis that the difference in group means is zero and an alternate hypothesis that the difference in group means is different from zero.

Statistical significance is a term used by researchers to state that it is unlikely their observations could have occurred under the null hypothesis of a statistical test . Significance is usually denoted by a p -value , or probability value.

Statistical significance is arbitrary – it depends on the threshold, or alpha value, chosen by the researcher. The most common threshold is p < 0.05, which means that the data is likely to occur less than 5% of the time under the null hypothesis .

When the p -value falls below the chosen alpha value, then we say the result of the test is statistically significant.

A test statistic is a number calculated by a  statistical test . It describes how far your observed data is from the  null hypothesis  of no relationship between  variables or no difference among sample groups.

The test statistic tells you how different two or more groups are from the overall population mean , or how different a linear slope is from the slope predicted by a null hypothesis . Different test statistics are used in different statistical tests.

Statistical tests commonly assume that:

  • the data are normally distributed
  • the groups that are being compared have similar variance
  • the data are independent

If your data does not meet these assumptions you might still be able to use a nonparametric statistical test , which have fewer requirements but also make weaker inferences.

Ask our team

Want to contact us directly? No problem.  We  are always here for you.

Support team - Nina

Our team helps students graduate by offering:

  • A world-class citation generator
  • Plagiarism Checker software powered by Turnitin
  • Innovative Citation Checker software
  • Professional proofreading services
  • Over 300 helpful articles about academic writing, citing sources, plagiarism, and more

Scribbr specializes in editing study-related documents . We proofread:

  • PhD dissertations
  • Research proposals
  • Personal statements
  • Admission essays
  • Motivation letters
  • Reflection papers
  • Journal articles
  • Capstone projects

Scribbr’s Plagiarism Checker is powered by elements of Turnitin’s Similarity Checker , namely the plagiarism detection software and the Internet Archive and Premium Scholarly Publications content databases .

The add-on AI detector is powered by Scribbr’s proprietary software.

The Scribbr Citation Generator is developed using the open-source Citation Style Language (CSL) project and Frank Bennett’s citeproc-js . It’s the same technology used by dozens of other popular citation tools, including Mendeley and Zotero.

You can find all the citation styles and locales used in the Scribbr Citation Generator in our publicly accessible repository on Github .

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 01 September 2024

Robust identification of perturbed cell types in single-cell RNA-seq data

  • Phillip B. Nicol 1 ,
  • Danielle Paulson 1 ,
  • Gege Qian 2 ,
  • X. Shirley Liu   ORCID: orcid.org/0000-0003-4736-7339 3 ,
  • Rafael Irizarry   ORCID: orcid.org/0000-0002-3944-4309 3 &
  • Avinash D. Sahu   ORCID: orcid.org/0000-0002-2193-6276 4  

Nature Communications volume  15 , Article number:  7610 ( 2024 ) Cite this article

2 Altmetric

Metrics details

  • Bioinformatics
  • Computational models
  • Data mining
  • Statistical methods
  • Transcriptomics

Single-cell transcriptomics has emerged as a powerful tool for understanding how different cells contribute to disease progression by identifying cell types that change across diseases or conditions. However, detecting changing cell types is challenging due to individual-to-individual and cohort-to-cohort variability and naive approaches based on current computational tools lead to false positive findings. To address this, we propose a computational tool, scDist , based on a mixed-effects model that provides a statistically rigorous and computationally efficient approach for detecting transcriptomic differences. By accurately recapitulating known immune cell relationships and mitigating false positives induced by individual and cohort variation, we demonstrate that scDist outperforms current methods in both simulated and real datasets, even with limited sample sizes. Through the analysis of COVID-19 and immunotherapy datasets, scDist uncovers transcriptomic perturbations in dendritic cells, plasmacytoid dendritic cells, and FCER1G+NK cells, that provide new insights into disease mechanisms and treatment responses. As single-cell datasets continue to expand, our faster and statistically rigorous method offers a robust and versatile tool for a wide range of research and clinical applications, enabling the investigation of cellular perturbations with implications for human health and disease.

Similar content being viewed by others

statistical analysis research methods

Atlas-scale single-cell multi-sample multi-condition data integration using scMerge2

statistical analysis research methods

Interpretation of T cell states from single-cell transcriptomics data using reference atlases

statistical analysis research methods

Data-driven comparison of multiple high-dimensional single-cell expression profiles

Introduction.

The advent of single-cell technologies has enabled measuring transcriptomic profiles at single-cell resolution, paving the way for the identification of subsets of cells with transcriptomic profiles that differ across conditions. These cutting-edge technologies empower researchers and clinicians to study human cell types impacted by drug treatments, infections like SARS-CoV-2, or diseases like cancer. To conduct such studies, scientists must compare single-cell RNA-seq (scRNA-seq) data between two or more groups or conditions, such as infected versus non-infected 1 , responders versus non-responders to treatment 2 , or treatment versus control in controlled experiments.

Two related but distinct classes of approaches exist for comparing conditions in single-cell data: differential abundance prediction and differential state analysis 3 . Differential abundance approaches, such as DA-seq, Milo, and Meld 4 , 5 , 6 , 7 , focus on identifying cell types with varying proportions between conditions. In contrast, differential state analysis seeks to detect predefined cell types with distinct transcriptomic profiles between conditions. In this study, we focus on the problem of differential state analysis.

Past differential state studies have relied on manual approaches involving visually inspecting data summaries to detect differences in scRNA data. Specifically, cells were clustered based on gene expression data and visualized using uniform manifold approximation (UMAP) 8 . Cell types that appeared separated between the two conditions were identified as different 1 . Another common approach is to use the number of differentially expressed genes (DEGs) as a metric for transcriptomic perturbation. However, as noted by ref. 9 , the number of DEGs depends on the chosen significance level and can be confounded by the number of cells per cell type because this influences the power of the corresponding statistical test. Additionally, this approach does not distinguish between genes with large and small (yet significant) effect sizes.

To overcome these limitations, Augur 9 uses a machine learning approach to quantify the cell-type specific separation between the two conditions. Specifically, Augur trains a classifier to predict condition labels from the expression data and then uses the area under the receiver operating characteristic (AUC) as a metric to rank cell types by their condition difference. However, Augur does not account for individual-to-individual variability (or pseudoreplication 10 ), which we show can confound the rankings of perturbed cell types.

In this study, we develop a statistical approach that quantifies transcriptomic shifts by estimating the distance (in gene expression space) between the condition means. This method, which we call scDist , introduces an interpretable metric for comparing different cell types while accounting for individual-to-individual and technical variability in scRNA-seq data using linear mixed-effect models. Furthermore, because transcriptomic profiles are high-dimensional, we develop an approximation for the between-group differences, based on a low-dimensional embedding, which results in a computationally convenient implementation that is substantially faster than Augur . We demonstrate the benefits using a COVID-19 dataset, showing that scDist can recover biologically relevant between-group differences while also controlling for sample-level variability. Furthermore, we demonstrated the utility of the scDist by jointly inferring information from five single-cell immunotherapy cohorts, revealing significant differences in a subpopulation of NK cells between immunotherapy responders and non-responders, which we validated in bulk transcriptomes from 789 patients. These results highlight the importance of accounting for individual-to-individual and technical variability for robust inference from single-cell data.

Not accounting for individual-to-individual variability leads to false positives

We used blood scRNA-seq from six healthy controls 1 (see Table  1 ), and randomly divided them into two groups of three, generating a negative control dataset in which no cell type should be detected as being different. We then applied Augur to these data. This procedure was repeated 20 times. Augur falsely identified several cell types as perturbed (Fig.  1 A). Augur quantifies differences between conditions with an AUC summary statistic, related to the amount of transcriptional separation between the two groups (AUC = 0.5 represents no difference). Across the 20 negative control repeats, 93% of the AUCs (across all cell typess) were >0.5, and red blood cells (RBCs) were identified as perturbed in all 20 trials (Fig.  1 A). This false positive result was in part due to high across-individual variability in cell types such as RBCs (Fig.  1 B).

figure 1

A AUCs achieved by Augur on 20 random partitions of healthy controls ( n  = 6 total patients divided randomly into two groups of 3), with no expected cell type differences (dashed red line indicates the null value of 0.5). B Boxplot depicting the second PC score for red blood cells from healthy individuals, highlighting high across-individual variability (each box represents a different individual). The boxplots display the median and first/third quartiles. C AUCs achieved by Augur on simulated scRNA-seq data (10 individuals, 50 cells per individual) with no condition differences but varying patient-level variability (dashed red line indicates the ground truth value of no condition difference, AUC 0.5), illustrating the influence of individual-to-individual variability on false positive predictions. Points and error bands represent the mean  ±1 SD. Source data are provided as a Source Data file.

We confirmed that individual-to-individual variation underlies false positive predictions made by Augur using a simulation. We generated simulated scRNA-seq data with no condition-level difference and varying patient-level variability (Methods). As patient-level variability increased, differences estimated by Augur also increased, converging to the maximum possible AUC of 1 (Fig.  1 C): Augur falsely interpreted individual-to-individual variability as differences between conditions.

Augur recommends that unwanted variability should be removed in a pre-processing step using batch correction software. We applied Harmony 11 to the same dataset 1 , treating each patient as a batch. We then applied Augur to the resulting batch corrected PC scores and found that several cell types still had AUCs significantly above the null value of 0.5 (Fig.  S1 a). On simulated data, batch correction as a pre-processing step also leads to confounding individual-to-individual variability as condition difference (Fig.  S1 b).

A model-based distance metric controls for false positives

To account for individual-to-individual variability, we modeled the vector of normalized counts with a linear mixed-effects model. Mixed models have previously been shown to be successful at adjusting for this source of variability 10 . Specifically, for a given cell type, let z i j be a length G vector of normalized counts for cell i and sample j ( G is the number of genes). We then model

where α is a vector with entries α g representing the baseline expression for gene g , x j is a binary indicator that is 0 if individual j is in the reference condition, and 1 if in the alternative condition, β is a vector with entries β g representing the difference between condition means for gene g , ω j is a random effect that represents the differences between individuals, and ε i j is a random vector (of length G ) that accounts for other sources of variability. We assume that \({{{\boldsymbol{\omega }}}}_{j}\mathop{ \sim }\limits^{{{\rm{i}}}.{{\rm{i}}}.{{\rm{d}}}}{{\mathcal{N}}}(0,{\tau }^{2}I)\) , \({{{\boldsymbol{\varepsilon }}}}_{ij}\mathop{ \sim }\limits^{{{\rm{i}}}.{{\rm{i}}}.{{\rm{d}}}}{{\mathcal{N}}}(0,{\sigma }^{2}I)\) , and that the ω j and ε i j are independent of each other.

To obtain normalized counts, we recommend defining z i j to be the vector of Pearson residuals obtained from fitting a Poisson or negative binomial GLM 12 , the normalization procedure is implemented in the scTransform function 13 . However, our proposed approach can be used with other normalization methods for which the model is appropriate.

Note that in model ( 18 ), the means for the two conditions are α and α  +  β , respectively. Therefore, we quantify the difference in expression profile by taking the 2 − norm of the vector β :

Here, D can be interpreted as the Euclidean distance between condition means (Fig.  2 A).

figure 2

A scDist estimates the distance between condition means in high-dimensional gene expression space for each cell type. B To improve efficiency, scDist calculates the distance in a low-dimensional embedding space (derived from PCA) and employs a linear mixed-effects model to account for sample-level and other technical variability. This figure is created with Biorender.com, was released under a Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International license.

Because we expected the vector of condition differences β to be sparse, we improved computational efficiency by approximating D with a singular value decomposition to find a K  ×  G matrix U , with K much smaller than G , and

With this approximation in place, we fitted model equation ( 18 ) by replacing z i j with U z i j to obtain estimates of ( U β ) k . A challenge with estimating D K is that the maximum likelihood estimator can have a significant upward bias when the number of patients is small (as is typically the case). For this reason, we employed a post-hoc Bayesian procedure to shrink \({{(U{\boldsymbol{\beta}} )}_{k}^{2}}\) towards zero and compute a posterior distribution of D K 14 . We also provided a statistical test for the null hypothesis that D K  = 0. We refer to the resulting procedure as scDist (Fig.  2 B). Technical details are provided in Methods.

We applied scDist to the negative control dataset based on blood scRNA-seq from six healthy used to show the large number of false positives reported by Augur (Fig.  1 ) and found that the false positive rate was controlled (Fig.  3 A, B). We then applied scDist to the data from the simulation study and found that, unlike Augur , the resulting distance estimate does not grow with individual-to-individual variability (Fig.  3 C). scDist also accurately estimated distances on fully simulated data (Fig.  S2 ).

figure 3

A Reanalysis of the data from Fig.  1 A using distances calculated with scDist ; the dashed red line represents ground truth value of 0. B As in A , but for p-values computed with scDist ; values above the dashed red line represent p  < 0.05. C Null simulation from Fig.  1 C reanalyzed using distances calculated by scDist ; the dashed red line represents the ground truth value of 0. Points and error bands represent the mean  ±1 SD. D Dendrogram generated by hierarchical clustering based on distances between pairs of cell types estimated by scDist . In all panels the boxplots display median and first/third quartiles. Source data are provided as a Source Data file.

The Euclidean distance D measures perturbation by taking the sum of squared differences across all genes. To show that this measure is biologically meaningful, we applied scDist to obtain estimated distances between pairs of known cell types in the above dataset and then applied hierarchical clustering to these distances. The resulting clustering is consistent with known relationships driven by cell lineages (Fig.  3 D). Specifically, Lymphoid cell types T and NK cells clustered together, while B cells were further apart, and Myeloid cell types DC, monocytes, and neutrophils were close to each other.

Though the scDist distance D assigns each gene an equal weight (unweighted), scDist includes an option to assign different weights w g to each gene (Methods). Weighting could be useful in situations where certain genes are known to contribute more to specific phenotypes. We conducted a simulation to study the impact of using the weighted distance. These simulations show that when a priori information is available, using the correct weighting leads to a slightly better estimation of the distance. However, incorrect weighting leads to significantly worse estimation compared to the unweighted distance (Fig.  S3 ). Therefore, the unweighted distance is recommended unless strong a priori information is available.

Challenges in cell type annotations are expected to impact scDist ’s interpretation, much like it does for other methods reliant on a priori cell type annotation such as 3 , 9 . Our simulations (see Methods), reveal scDist ’s vulnerability to false-negatives when annotations are confounded by condition- or patient-specific factors. However, when clusters are annotated using data where such differences have been removed, scDist’s predictions become more reliable (Fig.  S23 ). Thus, we recommend removing these confounders before annotation. As potential issues could occur when the inter-condition distance exceeds the inter-cell-type distance, scDist provides a diagnostic plot (Fig.  S6 ) to compare these two distances. scDist also incorporates an additional diagnostic feature (Fig.  S24 ) to identify annotation issues, utilizing a cell-type tree to evaluate cell relationships at different hierarchical levels. Inconsistencies in scDist ’s output signal potential clustering or annotation errors.

Comparison to counting the number of DEGs

We also compared scDist to the approach of counting the number of differentially expressed genes (nDEG) on pseudobulk samples 3 . Given that the statistical power to detect DEGs is heavily reliant on sample size, we hypothesized that nDEG could become a misleading measure of perturbation in single-cell data with a large variance in the number of cells per cell type. To demonstrate this, we applied both methods to resampled COVID-19 data 1 where the number of cells per cell type was artificially varied between 100 and 10,000. nDEG was highly confounded by the number of cells (Fig.  4 A), whereas the scDist distance remained relatively constant despite the varying number of cells (Fig.  4 B). When the number of subsampled cells is small, the ranking of cell types (by perturbation) was preserved by scDist but not by nDEG (Fig.  S5 a–c). Additionally, scDist was over 60 times faster than nDEG since the latter requires testing all G genes as opposed to K   ≪   G PCs (Fig.  S4 ).

figure 4

A Sampling with replacement from the COVID-19 dataset 1 to create datasets with a fixed number of cells per cell type, and then counting the number of differentially expressed genes (nDEG) for the CD14 monocytes. B Repeating the previous analysis with the scDist distance. C Comparing all pairs of cell types on the downsampled 15 dataset and applying hierarchical clustering to the pairwise perturbations. Leaves corresponding to T cells are colored blue while the leaf corresponding to monocytes is colored red. D The same analysis using the scDist distances. Source data are provided as a Source Data file.

An additional limitation of nDEG is that it does not account for the magnitude of the differential expression. We illustrated this with a simple simulation that shows the number of DEGs between two cell types can be the same (or less) despite a larger transcriptomic perturbation in gene expression space (Fig.  S7 a, b). To demonstrate this on real data, we considered a dataset consisting of eight sorted immune cell types (originally from ref. 15 and combined by ref. 16 ) where scDist and nDEG were applied to all pairs of cell types, and the perturbation estimates were visualized using hierarchical clustering. Although both nDEG and scDist performed well when the sample size was balanced across cell types (Fig.  S8 ), nDEG provided inconsistent results when the CD14 Monocytes were downsampled to create a heterogeneous cell type size distribution. Specifically, scDist produced the expected result of clustering the T cells together, whereas nDEG places the Monocytes in the same cluster as B and T cells (Fig.  4 C, D) despite the fact that these belong to different lineages. Thus by taking into account the magnitude of the differential expression, scDist is able to produce results more in line with known biology.

We also considered varying the number of patients on simulated data with a known ground truth. Again, the nDEG (computed using a mixed model, as recommended by ref. 10 ) increases as the number of patients increases, whereas scDist remains relatively stable (Fig.  S9 a). Moreover, the correlation between the ground truth perturbation and scDist increases as the number of patients increases (Fig.  S9 b). Augur was also sensitive to the number of samples and had a lower correlation with the ground truth than both nDEG and scDist .

scDist detects cell types that are different in COVID-19 patient compared to controls

We applied scDist to a large COVID-19 dataset 17 consisting of 1.4 million cells of 64 types from 284 PBMC samples from 196 individuals consisting of 171 COVID-19 patients and 25 healthy donors. The large number of samples of this dataset permitted further evaluation of our approach using real data rather than simulations. Specifically, we defined true distances between the two groups by computing the sum of squared log fold changes (across all genes) on the entire dataset and then estimated the distance on random samples of five cases versus five controls. Because Augur does not estimate distances explicitly, we assessed the two methods’ ability to accurately recapitulate the ranking of cell types based on established ground truth distances. We found that scDist recovers the rankings better than Augur (Fig.  5 A, S10 ). When the size of the subsample is increased to 15 patients per condition, the accuracy of scDist to recover the ground truth rank and distance improves further (Fig.  S25 ).

figure 5

A Correlation between estimated ranks (based on subsamples of 5 cases and 5 controls) and true ranks for each method, with points above the diagonal line indicate better agreement of scDist with the true ranking. B Plot of true distance vs. distances estimated with scDist (dashed line represents y  =  x ). Poins and error bars represent mean, and 5/95th percentile. C AUC values achieved by Augur , where color represents likely true (blue) or false (orange) positive cell types. D Same as C , but for distances estimated with scDist . E AUC values achieved by Augur against the cell number variation in subsampled-datasets (of false positive cell types). F Same as E , but for distances estimated with scDist . In all boxplots the median and first/third quartiles are reported. For C – E , 1000 random subsamples were used. Source data are provided as a Source Data file.

To evaluate scDist ’s accuracy further, we defined a new ground truth using the entire COVID-19 dataset, consisting two groups: four cell types with differences between groups (true positives) and five cell types without differences (false positives) (Fig.  S11 , Methods). We generated 1000 random samples with only five individuals per cell type and estimated group differences using both Augur and scDist . Augur failed to accurately separate the two groups (Fig.  5 C); median difference estimates of all true positive cell types, except MK167+ CD8+T, were lower than median estimates of all true negative cell types (Fig.  5 C). In contrast, scDist showed a separation between scDist estimates between the two groups (Fig.  5 D).

Single-cell data can also exhibit dramatic sample-specific variation in the number of cells of specific cell types. This imbalance can arise from differences in collection strategies, biospecimen quality, and technical effects, and can impact the reliability of methods that do not account for sample-to-sample or individual-to-individual variation. We measured the variation in cell numbers within samples by calculating the ratio of the largest sample’s cell count to the total cell counts across all samples (Methods). Augur ’s predictions were negatively impacted by this cell number variation (Figs.  5 E, S12 ), indicating its increased susceptibility to false positives when sample-specific cell number variation was present (Fig.  1 C). In contrast, scDist ’s estimates were robust to sample-specific cell number variation in single-cell data (Fig.  5 F).

To further demonstrate the advantage of statistical inference in the presence of individual-to-individual variation, we analyzed the smaller COVID-19 dataset 1 with only 13 samples. The original study 1 discovered differences between cases and controls in CD14+ monocytes through extensive manual inspection. scDist identified this same group as the most significantly perturbed cell type. scDist also identified two cell types not considered in the original study, dendritic cells (DCs) and plasmacytoid dendritic cells (pDCs) ( p  = 0.01 and p  = 0.04, Fig.  S13 a), although pDC did not remain significant after adjusting for multiple testing. We note that DCs induce anti-viral innate and adaptive responses through antigen presentation 18 . Our finding was consistent with studies reporting that DCs and pDCs are perturbed by COVID-19 infection 19 , 20 . In contrast, Augur identified RBCs, not CD14+ monocytes, as the most perturbed cell type (Fig.  S14 ). Omitting the patient with the most RBCs dropped the perturbation between infected and control cases estimated by Augur for RBCs markedly (Fig.  S14 ), further suggesting that Augur predictions are clouded by patient-level variability.

scDist enables the identification of genes underlying cell-specific across-condition differences

To identify transcriptomic alteration, scDist assigns an importance score to each gene based on its contribution to the overall perturbation (Methods). We assessed this importance score for CD14+ monocytes in small COVID-19 datasets. In this cell type, scDist assigned the highest importance score to genes S100 calcium-binding protein A8 ( S100A8 ) and S100 calcium-binding protein A9 ( S100A9 ) ( p  < 10 −3 , Fig.  S13 b). These genes are canonical markers of inflammation 21 that are upregulated during cytokine storm. Since patients with severe COVID-19 infections often experience cytokine storms, the result suggests that S100A8/A9 upregulation in CD14+ monocyte could be a marker of the cytokine storm 22 . These two genes were reported to be upregulated in COVID-19 patients in the study of 284 samples 17 .

scDist identifies transcriptomic alterations associated with immunotherapy response

To demonstrate the real-world impact of scDist , we applied it to four published dataset used to understand patient responses to cancer immunotherapy in head and neck, bladder, and skin cancer patients, respectively 2 , 23 , 24 , 25 . We found that each individual dataset was underpowered to detect differences between responders and non-responders (Fig.  S15 ). To potentially increase power, we combined the data from all cohorts (Fig.  6 A). However, we found that analyzing the combined data without accounting for cohort-specific variations led to false positives. For example, responder-non-responder differences estimated by Augur were highly correlated between pre- and post-treatments (Fig.  6 B), suggesting a confounding effect of cohort-specific variations. Furthermore, Augur predicted that most cell types were altered in both pre-treatment and post-treatment samples (AUC > 0.5 for 41 in pre-treatment and 44 in post-treatment out of a total of 49 cell types), which is potentially due to the confounding effect of cohort-specific variations.

figure 6

A Study design: discovery cohorts of four scRNA cohorts (cited in order as shown 2 , 23 , 24 , 25 ) identify cell-type-specific differences and a differential gene signature between responders and non-responders. This signature was evaluated in validation cohorts of six bulk RNA-seq cohorts (cited in order as shown 32 , 33 , 34 , 35 , 36 , 37 , 38 ). B Pre-treatment and post-treatment sample differences were estimated using Augur and scDist (Spearman correlation is reported on the plot). The error bars represent 95% confidence interval for the fitted linear regression line. C Significance of the estimated differences ( scDist ). D Kaplan–Meier plots display the survival differences in anti-PD-1 therapy patients, categorized by low-risk and high-risk groups using the median value of the NK-2 signature value; overall and progression-free survival is shown. E NK-2 signature levels in non-responders, partial-responders, and responders (Bulk RNA-seq cohorts). The boxplots report the median and first/third quartiles. A created with BioRender.com, released under a Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International license. Source data are provided as a Source Data file.

To account for cohort-specific variations, we ran scDist including an explanatory variable to the model ( 18 ) to account for cohort effects. With this approach, distance estimates were not correlated significantly between pre- and post-treatment (Fig.  6 B). Removal of these variables re-established correlation (Fig.  S16 ). scDist predicted CD4-T and CD8-T altered pre-treatment (Fig.  S17 a), while NK, CD8-T, and B cells altered post-treatment (Fig.  S17 b). Analysis of subtypes revealed FCER1G+NK cells (NK-2) were changed in both pre-treatment and post-treatment samples (Fig.  6 C). To validate this finding, we generated an NK-2 signature differential between responders and non-responders (Fig.  S18 Methods) and evaluated these signatures in bulk RNA-seq immunotherapy cohorts, composing 789 patient samples (Fig.  6 A). We scored each of the 789 patient samples using the NK-2 differential signature (Methods). The NK-2 signature scores were significantly associated with overall and progression-free survival (Fig.  6 D) as well as radiology-based response (Fig.  6 E). We similarly evaluated the top Augur prediction. Differential signature from plasma, the top predicted cell type by Augur , did not show an association with the response or survival outcomes in 789 bulk transcriptomes (Fig.  S19 , Methods).

scDist is computationally efficient

A key strength of the linear modeling framework used by scDist is that it is efficient on large datasets. For instance, on the COVID-19 dataset with 13 samples 1 , scDist completed the analysis in around 50 seconds, while Augur required 5 minutes. To better understand how runtime depends on the number of cells, we applied both methods to subsamples of the dataset that varied in size and observed that scDist was, on average, five-fold faster (Fig.  S20 ). scDist is also capable of scaling to millions of cells. On simulated data, scDist required approximately 10 minutes to fit a dataset with 1,000,000 cells (Fig.  S21 ). We also tested the sensitivity of scDist to the number of PCs used by comparing D K for various values of K . We observed that the estimated distances stabilize as K increases (Fig.  S22 ), justifying K  = 20 as a reasonable choice for most datasets.

The identification of cell types influenced by infections, treatments, or biological conditions is crucial for understanding their impact on human health and disease. We present scDist , a statistically rigorous and computationally fast method for detecting cell-type specific differences across multiple groups or conditions. By using a mixed-effects model, scDist estimates the difference between groups while quantifying the statistical uncertainty due to individual-to-individual variation and other sources of variability. We validated scDist through the unbiased recapitulation of known relationships between immune cells and demonstrated its effectiveness in mitigating false positives from patient-level and technical variations in both simulated and real datasets. Notably, scDist facilitates biological discoveries from scRNA cohorts, even when the number of individuals is limited, a common occurrence in human scRNA-seq datasets. We also pointed out how the detection of cell-type specific differences can be obscured by batch effects or other confounders and how the linear model used by our approach permits accounting for these.

Since the same expression data is used for annotation and by scDist , there are potential issues associated with “double dipping.” Our simulation highlighted this issue by showing that condition-specific effects can result in over-clustering and downward bias in the estimated distances (Methods, Fig.  S23 ). Users can avoid these false negatives by using annotation approaches that can control for patient and condition-specific effects. scDist provides two diagnostic tools to help users identify potential issues in their annotation (Figs.  S24 and S6 . Despite this, significant errors in clustering and annotation could cause unavoidable bias in scDist , and thus designing a cluster-free extension of scDist is an area for future work. scDist also provides a diagnostic tool that estimates distances at multiple resolutions to help users identify potential issues in their annotation (Fig.  S24 ). Another point of sensitivity for scDist is the choice of the number of principal components used to estimate the distance. Although in practice we observed that the distance estimate is stable as the number of PCs varies between 20 and 50 (Fig.  S22 ), an adaptive approach for selecting K could improve performance and maximize power. Finally, although Pearson residual-based normalized counts 12 , 13 is recommended input for scDist , if the data available was normalized by another, sub-optimal, approach, scDist ’s performances could be affected. A future version could adapt the model and estimation procedure so that scDist can be directly applied to the counts, and avoid potential problems introduced by normalization.

We believe that scDist will have extensive utility, as the comparison of single-cell experiments between groups is a common task across a range of research and clinical applications. In this study, we have focused on examining discrete phenotypes, such as infected versus non-infected (in COVID-19 studies) and responders vs. non-responders to checkpoint inhibitors. However, the versatility of our framework allows for extension to experiments involving continuous phenotypes or conditions, such as height, survival, and exposure levels, to name a few. As single-cell datasets continue to grow in size and complexity, scDist will enable rigorous and reliable insights into cellular perturbations with implications for human health and disease.

Normalization

Our method takes as input a normalized count matrix (with corresponding cell type annotations). We recommend using scTransform 13 to normalize, although the method is compatible with any normalization approach. Let y i j g be the UMI counts for gene 1 ≤  g  ≤  G in cell i from sample j . scTransform fits the following model:

where r i j is the total number of UMI counts for the particular cell. The normalized counts are given by the Pearson residuals of the above model:

Distance in normalized expression space

In this section, we describe the inferential procedure of scDist for cases without additional covariates. However, the procedure can be generalized to the full model ( 18 ) with arbitrary covariates (design matrix) incorporating random and fixed effects, as well as nested-effect mixed models. For a given cell type, we model the G -dimensional vector of normalized counts as

where \({\boldsymbol{\alpha}},{\boldsymbol{\beta}} \in {{\mathbb{R}}}^{G}\) , x i j is a binary indicator of condition, \({\boldsymbol{\omega }}_{j} \sim {{\mathcal{N}}}(0,{\tau }^{2}{I}_{G})\) , and \({\boldsymbol{\varepsilon }}_{ij} \sim {{\mathcal{N}}}(0,{\sigma }^{2}{I}_{G})\) . The quantity of interest is the Euclidean distance between condition means α and α  +  β :

If \(U\in {{\mathbb{R}}}^{G\times G}\) is an orthonormal matrix, we can apply U to equation ( 6 ) to obtain the transformed model :

Since U is orthogonal, U ω j and U ε i j still have spherical normal distributions. We also have that

This means that the distance in the transformed model is the same as in the original model. As mentioned earlier, our goal is to find U such that

with K   ≪   G .

Let \(Z\in {{\mathbb{R}}}^{n\times G}\) be the matrix with rows z i j (where n is the total number of cells). Intuitively, we want to choose a U such that the projection of z i j onto the first K rows of U ( \({u}_{1},\ldots,{u}_{K}\in {{\mathbb{R}}}^{G}\) ) minimizes the reconstruction error

where \(\mu \in {{\mathbb{R}}}^{G}\) is a shift vector and \(({v}_{ik})\in {{\mathbb{R}}}^{n\times K}\) is a matrix of coefficients. It can be shown that the PCA of Z yields the (orthornormal) u 1 , …,  u K that minimizes this reconstruction error 26 .

Given an estimator \(\widehat{{(U{\boldsymbol{\beta}} )}_{k}}\) of ( U β ) k , a naive estimator of D K is given by taking the square root of the sum of squared estimates:

However, this estimator can have significant upward bias due to sampling variability. For instance, even if the true distance is 0, \(\widehat{{(U\beta )}_{k}}\) is unlikely to be exactly zero, and that noise becomes strictly positive when squaring.

To account for this, we apply a post-hoc Bayesian procedure to the \({\widehat{U\beta }}_{k}\) to shrink them towards zero before computing the sum of squares. In particular, we adopt the spike slab model of 14

where \({{\rm{Var}}}[\widehat{{(U{\boldsymbol{\beta}} )}_{k}}]\) is the variance of the estimator \(\widehat{{(U{\boldsymbol{\beta}} )}_{k}}\) , δ 0 is a point mass at 0, and π 0 ,  π 1 , … π T are mixing weights (that is, they are non-negative and sum to 1). 14 provides a fast empirical Bayes approach to estimate the mixing weights and obtain posterior samples of ( U β ) k . Then samples from the posterior of D K are obtained by applying the formula ( 12 ) to the posterior samples of ( U β ) k . We then summarize the posterior distribution by reporting the median and other quantiles. Advantage of this particular specification is that the amount of shrinkage depends on the uncertainty in the initial estimate of ( U β ) k .

We use the following procedure to obtain \({\widehat{U\beta }}_{k}\) :

Use the matrix of PCA loadings as a plug in estimator for U . Then U z i j is the vector of PC scores for cell i in sample j .

Estimate ( U β ) k by using lme4 27 to fit the model ( 6 ) using the PC scores corresponding to the k -th loading (i.e., each dimension is fit independently).

Note that only the first K rows of U need to be stored.

We are particularly interested in testing the null hypothesis of D K  = 0 against the alternative D d  > 0. Because the null hypothesis corresponds to ( U β ) k  = 0 for all 1 ≤  k  ≤  d , we can use the sum of individual Wald statistics as our test statistic:

Under the null hypothesis that ( U β ) k  = 0, W k can be approximated by a \({F}_{{\nu }_{k},1}\) distribution. ν k is estimated using Satterthwaite’s approximation in lmerTest . This implies that

under the null. Moreover, the W k are independent because we have assumed that covariance matrices for the sample and cell-level noise are multiples of the identity. Equation ( 16 ) is not a known distribution but quantiles can be approximated using Monte Carlo samples. To make this precise, let W 1 , …,  W M be draws from equation ( 16 ), where M  = 10 5 and let W * be the value of equation ( 15 ) (i.e., the actual test statistic). Then the empirical p -value 28 is computed as

Controlling for additional covariates

Because scDist is based on a linear model, it is straightforward to control for additional covariates such as age or sex of a patient in the analysis. In particular, model ( 18 ) can be replaced with

where \({w}_{ijk}\in {\mathbb{R}}\) is the value of the k th covariate for cell i in sample j and \({\boldsymbol{\gamma }}_{k}\in {{\mathbb{R}}}^{G}\) is the corresponding gene-specific effect corresponding to the k th covariate.

Choosing the number of principal components

An important choice in scDist is the number of principal components d . If d is chosen too small, then estimation accuracy may suffer as the first few PCs may not capture enough of the distance. On the other hand, if d is chosen too large then the power may suffer as a majority of the PCs will simply be capturing random noise (and adding to degrees of freedom to the Wald statistic). Moreover, it is important that d is chosen a priori, as choosing the d that produces the lowest p values is akin to p -hacking.

If the model is correctly specified then it is reasonable to choose d  =  J − 1, where J is the number of samples (or patients). To see why, notice that the mean expression in sample 1 ≤  j  ≤  J is

In particular, the J sample means lie on a ( J  − 1)-dimensional subspace in \({{\mathbb{R}}}^{G}\) . Under the assumption that the condition difference and sample-level variability is larger than the error variance σ 2 , we should expect that the first J  − 1 PC vectors capture all of the variance due to differences in sample means.

In practice, however, the model can not be expected to be correctly specified. For this reason, we find that d  = 20 is a reasonable choice when the number of samples is small (as is usually the case in scRNA-seq) and d  = 50 for datasets with a large number of samples. This is line with other single-cell methods, where the number of PCs retained is usually between 20 and 50.

Cell type annotation and “double dipping”

scDist takes as input an annotated list of cells. A common approach to annotate cells is to cluster based on gene expression. Since scDist also uses the gene expression data to measure the condition difference there are concerns associated with “double-dipping” or using the data twice. In particular, if the condition difference is very large and all of the data is used to cluster it is possible that the cells in the two conditions would be assigned to different clusters. In this case scDist would be unable to estimate the inter-condition distance, leading to a false negative. In other words, the issue of double dipping could cause scDist to be more conservative. Note that the opposite problem occurs when performing differential expression between two estimated clusters; in this case, the p -values corresponding to genes will be anti-conservative 29 .

To illustrate, we simulated a normalized count matrix with 4000 cells and 1000 genes in such a way that there are two “true” cell types and a true condition distance of 4 for both cell types (Fig.  S23 a). To cluster (annotate) the cells, we applied k -means with various choices of k and compared results by taking the median inter-condition distance across all clusters. As the number of clusters increases, the median distance decays towards 0, which demonstrates that scDist can produce false negatives when the data is over-clustered (Fig.  S23 b). To avoid this issue, one possible approach is to begin by clustering the data for only one condition and then to assign cells in the other condition by finding the nearest centroid in the existing clusters. When applied to the simulated data this approach is able to correctly estimate the condition distance even when the number of clusters k is larger than the true value.

On real data, one approach to identify possible over-clustering is to apply scDist at various cluster resolutions. We used the expression data from the small COVID-19 data 1 to construct a tree \({{\mathcal{T}}}\) with leaf nodes corresponding to the cell types in the original annotation provided by the authors (Fig.  S24 , see Appendix A for a description of how the tree is estimated). At each internal node \(v\in {{\mathcal{T}}}\) , we applied scDist to the cluster containing all children of v . We can then visualize the estimated distances by plotting the tree (Fig.  S24 ). Situations where the child nodes have a small distance but the parent node has a large distance could be indicative of over-clustering. For example, PB cells are almost exclusiviely found in cases (1977 cells in cases and 86 cells in controls), suggesting that it is reasonable to consider PB and B cells as a single-cell type when applying scDist .

Feature importance

To better understand the genes that drive the observed difference in the CD14+ monocytes, we define a gene importance score . For 1 ≤  k  ≤  d and 1 ≤  g  ≤  G , the k -th importance score for gene g is ∣ U k g ∣ β g . In other words, the importance score is the absolute value of the gene’s k -th PC loading times its expression difference between the two conditions. Note that the gene importance score is 0 if and only if β g  = 0 or U k g  = 0. Since the U k g are fixed and known, significance can be assigned to the gene importance score using the differential expression method used to estimate β g .

Simulated single-cell data

We test the method on data generated from model equation ( 6 ). To ensure that the “true” distance is D , we use the R package uniformly 30 to draw β from the surface of the sphere of radius D in \({{\mathbb{R}}}^{G}\) . The data in Figs.  1 C and 3 C are obtained by setting β  = 0 and σ 2  = 1 and varying τ 2 between 0 and 1.

Weighted distance

By default, scDist uses the Euclidean distance D which treats each gene equally. In cases where a priori information is available about the relevance of each gene, scDist provides the option to estimate a weighted distance D w , where \(w\in {{\mathbb{R}}}^{G}\) has non-negative components and

The weighted distance can be written in matrix form by letting \(W\in {{\mathbb{R}}}^{G\times G}\) be a diagonal matrix with W g g  =  w g , so that

Thus, the weighted distance can be estimated by instead considered the transformed model where \(U\sqrt{W}\) is applied to each z i j . After this different transformed model is obtained, estimation and inference of D w proceeds in exactly the same way as the unweighted case.

To test the accuracy of the weighted distance estimate, we considered a simulation where each gene had only a 10% chance of having β g  ≠ 0 (otherwise \({\beta }_{g} \sim {{\mathcal{N}}}(0,1)\) ). We then considered three scenarios: w g  = 1 if β g  ≠ 0 and w g  = 0 otherwise (correct weighting), w g  = 1 for all g (unweighted), and w g  = 1 randomly with probability 0.1 (incorrect weights). We then quantified the performance by taking the absolute value of the error between \({\sum }_{g}{\beta }_{g}^{2}\) and the estimated distance. Figure  S3 shows that correct weighting slightly outperforms unweighted scDist but random weights are significantly worse. Thus, the unweighted version of scDist should be preferred unless strong a priori information is available.

Robustness to model misspecification

The scDist model assumes that the cell-specific variance σ 2 and sample-specific variance τ 2 are shared across genes. The purpose of this assumption is to ensure that the noise in the transformed model follows a spherical normal distribution. Violations of this assumption could lead to miscalibrated standard errors and hypothesis tests but should not effect estimation. To demonstrate this, we considered simulated data where each gene has σ g  ~ Gamma( r ,  r ) and τ g  ~ Gamma( r /2,  r ). As r varies, the quality of the distance estimates does not change significantly (Fig.  S26 ).

Semi-simulated COVID-19 data

COVID-19 patient data for the analysis was obtained from ref. 17 , containing 1.4 million cells of 64 types from 284 PBMC samples collected from 196 individuals, including 171 COVID-19 patients and 25 healthy donors.

Ground truth

We define the ground truth as the cell-type specific transcriptomic differences between the 171 COVID-19 patients and the 25 healthy controls. Specifically, we used the following approach to define a ground truth distance:

For each gene g , we computed the log fold changes L g between COVID-19 cases and controls, with L g  =  E g ( C o v i d ) −  E g ( C o n t r o l ), where E g denotes the log-transformed expression data \(\log (1+x)\) .

The ground truth distance is then defined as \(D={\sum }_{g}{L}_{g}^{2}\) .

Subsequently, we excluded any cell types not present in more than 10% of the samples from further analysis. For true negative cell types, we identified the top 5 with the smallest fold change and a representation of over 20,000 cells within the entire dataset. When attempting similar filtering based on cell count alone, no cell types demonstrated a sufficiently large true distance. Consequently, we chose the top four cell types with over 5000 cells as our true positives Fig.  S11 .

Using the ground truth, we performed two separate simulation analyses:

1: Simulation analyses I (Fig.  5 A, B): Using one half of the dataset (712621 cells, 132 case samples, 20 control samples), we created 100 subsamples consisting of 5 cases and 5 controls. For each subsample, we applied both scDist and Augur to estimate perturbation/distance between cases and controls for each cell type. Then we computed the correlation between the ground truth ranking (ordering cells by sum of log fold changes on the whole dataset) and the ranking obtained by both methods. For scDist , we restricted to cell types that had a non-zero distance estimate in each subsample, and for Augur we restricted to cell types that had an AUC greater than 0.5 (Fig.  5 A). For Fig.  5 B, we took the mean estimated distance across subsamples for which the given cell type had a non-zero distance estimate. This is because in some subsamples a given cell type could be completely absent.

2: Simulation analyses II (Fig.  5 C–F): We subsampled the COVID-19 cohort with 284 samples (284 PBMC samples from 196 individuals: 171 with COVID-19 infection and 25 healthy controls) to create 1,000 downsampled cohorts, each containing samples from 10 individuals (5 with COVID-19 and 5 healthy controls). We randomly selected each sample from the downsampled cohort, further downsampled the number of cells for each cell type, and selected them from the original COVID-19 cohort. This downsampling procedure increases both cohort variability and cell-number variations.

Performance Evaluation in Subsampled Cohorts : We applied scDist and Augur to each subsampled cohort, comparing the results for true positive and false positive cell types. We partitioned the sampled cohorts into 10 groups based on cell-number variation, defined as the number of cells in a sample with the highest number of cells for false-negative cell types divided by the average number of cells in cell types. This procedure highlights the vulnerability of computational methods to cell number variation, particularly in negative cell types.

Analysis of immunotherapy cohorts

Data collection.

We obtained single-cell data from four cohorts 2 , 23 , 24 , 25 , including expression counts and patient response information.

Pre-processing

To ensure uniform processing and annotation across the four scRNA cohorts, we analyzed CD45+ cells (removing CD45− cells) in each cohort and annotated cells using Azimuth 31 with reference provided for CD45+ cells.

Model to account for cohort and sample variance

To account for cohort-specific and sample-specific batch effects, scDist modeled the normalized gene expression as:

Here, Z represents the normalized count matrix, X denotes the binary indicator of condition (responder = 1, non-responder = 0); γ and ω are cohort and sample-level random effects, and (1 ∣ γ :  ω ) models nested effects of samples within cohorts. The inference procedure for distance, its variance, and significance for the model with multiple cohorts is analogous to the single-cohort model.

We estimated the signature in the NK-2 cell type using differential expression between responders and non-responders. To account for cohort-specific and patient-specific effects in differential expression estimation, we employed a linear mixed model described above for estimating distances, performing inference for each gene separately. The coefficient of X inferred from the linear mixed models was used as the estimate of differential expression:

Here, Z represents the normalized count matrix, X denotes the binary indicator of condition (responder = 1, non-responder = 0); γ and ω are cohort and sample-level random effects, and (1 ∣ γ :  ω ) models nested effects of samples within cohorts.

Bulk RNA-seq cohorts

We obtained bulk RNA-seq data from seven cancer cohorts 32 , 33 , 34 , 35 , 36 , 37 , 38 , comprising a total of 789 patients. Within each cohort, we converted counts of each gene to TPM and normalized them to zero mean and unit standard deviation. We collected survival outcomes (both progression-free and overall) and radiologic-based responses (partial/complete responders and non-responders with stable/progressive disease) for each patient.

Evaluation of signature in bulk RNA-seq cohorts

We scored each bulk transcriptome (sample) for the signature using the strategy described in ref. 39 . Specifically, the score was defined as the Spearman correlation between the normalized expression and differential expression in the signature. We stratified patients into two groups using the median score for patient stratification. Kaplan–Meier plots were generated using these stratifications, and the significance of survival differences was assessed using the log-rank test. To demonstrate the association of signature levels with radiological response, we plotted signature levels separately for non-responders, partial-responders, and responders.

Evaluating Augur Signature in Bulk RNA-Seq Cohorts

A differential signature was derived for Augur ’s top prediction, plasma cells, using a procedure analogous to the one described above for scDist . This plasma signature was then assessed in bulk RNA-seq cohorts following the same evaluation strategy as applied to the scDist signature.

Statistics and reproducibility

No statistical method was used to predetermine sample size. No data were excluded from the analyses. The experiments were not randomized. The Investigators were not blinded to allocation during experiments and outcome assessment.

Reporting summary

Further information on research design is available in the  Nature Portfolio Reporting Summary linked to this article.

Data availability

Table  1 gives a list of the datasets used in each figure, as well as details about how the datasets can be obtained.  Source data are provided with this paper.

Code availability

scDist is available as an R package and can be downloaded from GitHub 40 : github.com/phillipnicol/scDist . The repository also includes scripts to replicate some of the figures and a demo of scDist using simulated data.

Wilk, A. J. et al. A single-cell atlas of the peripheral immune response in patients with severe covid-19. Nat. Med. 26 , 1070–1076 (2020).

Article   PubMed   PubMed Central   Google Scholar  

Yuen, K. C. et al. High systemic and tumor-associated il-8 correlates with reduced clinical benefit of pd-l1 blockade. Nat. Med. 26 , 693–698 (2020).

Crowell, H. L. et al. Muscat detects subpopulation-specific state transitions from multi-sample multi-condition single-cell transcriptomics data. Nat. Commun. 11 , 6077 (2020).

Article   ADS   PubMed   PubMed Central   Google Scholar  

Helmink, B. A. et al. B cells and tertiary lymphoid structures promote immunotherapy response. Nature 577 , 549–555 (2020).

Zhao, J. et al. Detection of differentially abundant cell subpopulations in scrna-seq data. Proc. Natl. Acad. Sci. 118 , e2100293118 (2021).

Dann, E., Henderson, N. C., Teichmann, S. A., Morgan, M. D. & Marioni, J. C. Differential abundance testing on single-cell data using k-nearest neighbor graphs. Nat. Biotechnol. 40 , 245–253 (2022).

Article   PubMed   Google Scholar  

Burkhardt, D. B. et al. Quantifying the effect of experimental perturbations at single-cell resolution. Nat. Biotechnol. 39 , 619–629 (2021).

McInnes, L., Healy, J. & Melville, J. Umap: uniform manifold approximation and projection for dimension reduction. arXiv https://arxiv.org/abs/1802.03426 (2018).

Skinnider, M. A. et al. Cell type prioritization in single-cell data. Nat. Biotechnol. 39 , 30–34 (2021).

Zimmerman, K. D., Espeland, M. A. & Langefeld, C. D. A practical solution to pseudoreplication bias in single-cell studies. Nat. Commun. 12 , 1–9 (2021).

Article   Google Scholar  

Korsunsky, I. et al. Fast, sensitive and accurate integration of single-cell data with harmony. Nat. Methods 16 , 1289–1296 (2019).

Townes, F. W., Hicks, S. C., Aryee, M. J. & Irizarry, R. A. Feature selection and dimension reduction for single-cell rna-seq based on a multinomial model. Genome Biol. 20 , 1–16 (2019).

Hafemeister, C. & Satija, R. Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression. Genome Biol. 20 , 1–15 (2019).

Stephens, M. False discovery rates: a new deal. Biostatistics 18 , 275–294 (2017).

MathSciNet   PubMed   Google Scholar  

Zheng, G. X. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8 , 14049 (2017).

Duò, A., Robinson, M. D. & Soneson, C. A systematic performance evaluation of clustering methods for single-cell RNA-seq data. F1000Res. 7 , 1141 (2018).

Ren, X. et al. Covid-19 immune features revealed by a large-scale single-cell transcriptome atlas. Cell 184 , 1895–1913 (2021).

Galati, D., Zanotta, S., Capitelli, L. & Bocchino, M. A bird’s eye view on the role of dendritic cells in sars-cov-2 infection: Perspectives for immune-based vaccines. Allergy 77 , 100–110 (2022).

Pérez-Gómez, A. et al. Dendritic cell deficiencies persist seven months after sars-cov-2 infection. Cell. Mol. Immunol. 18 , 2128–2139 (2021).

Upadhyay, A. A. et al. Trem2+ and interstitial macrophages orchestrate airway inflammation in sars-cov-2 infection in rhesus macaques. bioRxiv https://www.biorxiv.org/content/10.1101/2021.10.05.463212v1 (2021).

Wang, S. et al. S100a8/a9 in inflammation. Front. Immunol. 9 , 1298 (2018).

Mellett, L. & Khader, S. A. S100a8/a9 in covid-19 pathogenesis: impact on clinical outcomes. Cytokine Growth Factor Rev. 63 , 90–97 (2022).

Luoma, A. M. et al. Tissue-resident memory and circulating t cells are early responders to pre-surgical cancer immunotherapy. Cell 185 , 2918–2935 (2022).

Yost, K. E. et al. Clonal replacement of tumor-specific t cells following pd-1 blockade. Nat. Med. 25 , 1251–1259 (2019).

Sade-Feldman, M. et al. Defining T cell states associated with response to checkpoint immunotherapy in melanoma. Cell 175 , 998–1013 (2018).

Pearson, K. Liii. on lines and planes of closest fit to systems of points in space. Lond. Edinb. Dublin Philos. Mag. J. Sci. 2 , 559–572 (1901).

Bates, D., Mächler, M., Bolker, B. & Walker, S. Fitting linear mixed-effects models using lme4. J. Stat. Softw. 67 , 1–48 (2015).

North, B. V., Curtis, D. & Sham, P. C. A note on the calculation of empirical p values from Monte Carlo procedures. Am. J. Hum. Genet. 71 , 439–441 (2002).

Neufeld, A., Gao, L. L., Popp, J., Battle, A. & Witten, D. Inference after latent variable estimation for single-cell RNA sequencing data. arXiv https://arxiv.org/abs/2207.00554 (2022).

Laurent, S. uniformly: uniform sampling. R package version 0.2.0 https://CRAN.R-project.org/package=uniformly (2022).

Hao, Y. et al. Integrated analysis of multimodal single-cell data. Cell 184 , 3573–3587 (2021).

Mariathasan, S. et al. Tgf β attenuates tumour response to pd-l1 blockade by contributing to exclusion of T cells. Nature 554 , 544–548 (2018).

Weber, J. S. et al. Sequential administration of nivolumab and ipilimumab with a planned switch in patients with advanced melanoma (checkmate 064): an open-label, randomised, phase 2 trial. Lancet Oncol. 17 , 943–955 (2016).

Liu, D. et al. Integrative molecular and clinical modeling of clinical outcomes to pd1 blockade in patients with metastatic melanoma. Nat. Med. 25 , 1916–1927 (2019).

McDermott, D. F. et al. Clinical activity and molecular correlates of response to atezolizumab alone or in combination with bevacizumab versus sunitinib in renal cell carcinoma. Nat. Med. 24 , 749–757 (2018).

Riaz, N. et al. Tumor and microenvironment evolution during immunotherapy with nivolumab. Cell 171 , 934–949 (2017).

Miao, D. et al. Genomic correlates of response to immune checkpoint therapies in clear cell renal cell carcinoma. Science 359 , 801–806 (2018).

Van Allen, E. M. et al. Genomic correlates of response to ctla-4 blockade in metastatic melanoma. Science 350 , 207–211 (2015).

Sahu, A. et al. Discovery of targets for immune–metabolic antitumor drugs identifies estrogen-related receptor alpha. Cancer Discov. 13 , 672–701 (2023).

Nicol, P. scdist https://doi.org/10.5281/zenodo.12709683 (2024).

Download references

Acknowledgements

P.B.N. is supported by NIH T32CA009337. A.D.S. received support from R00CA248953, the Michelson Foundation, and was partially supported by the UNM Comprehensive Cancer Center Support Grant NCI P30CA118100. We express our gratitude to Adrienne M. Luoma, Shengbao Suo, and Kai W. Wucherpfennig for providing the scRNA data 23 . We also thank Zexian Zeng for assistance with downloading and accessing the bulk RNA-seq dataset.

Author information

Authors and affiliations.

Harvard University, Cambridge, MA, USA

Phillip B. Nicol & Danielle Paulson

University of California San Diego School of Medicine, San Diego, CA, USA

Dana-Farber Cancer Institute, Boston, MA, USA

X. Shirley Liu & Rafael Irizarry

University of New Mexico Comprehensive Cancer Center, Albuquerque, NM, USA

Avinash D. Sahu

You can also search for this author in PubMed   Google Scholar

Contributions

P.B.N., D.P., G.Q., X.S.L., R.I., and A.D.S. conceived the study. P.B.N. and A.D.S. implemented the method and performed the experiments. P.B.N., R.I., and A.D.S. wrote the manuscript.

Corresponding authors

Correspondence to Rafael Irizarry or Avinash D. Sahu .

Ethics declarations

Competing interests.

X.S.L. conducted the work while being on the faculty at DFCI, and is currently a board member and CEO of GV20 Therapeutics. P.B.N., D.P., G.Q., R.I., and A.D.S. declare no competing interests.

Peer review

Peer review information.

Nature Communications thanks the anonymous reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary information, peer review file, reporting summary, source data, source data, rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Nicol, P.B., Paulson, D., Qian, G. et al. Robust identification of perturbed cell types in single-cell RNA-seq data. Nat Commun 15 , 7610 (2024). https://doi.org/10.1038/s41467-024-51649-3

Download citation

Received : 14 December 2023

Accepted : 09 August 2024

Published : 01 September 2024

DOI : https://doi.org/10.1038/s41467-024-51649-3

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

statistical analysis research methods

People-environment relations following COVID-19 pandemic lifestyle restrictions: a multinational, explorative analysis of intended biophilic design changes

  • Open access
  • Published: 02 September 2024
  • Volume 5 , article number  229 , ( 2024 )

Cite this article

You have full access to this open access article

statistical analysis research methods

  • Kalterina Shulla 1 ,
  • Bernd-Friedrich Voigt 2 ,
  • Salim Lardjane 3 ,
  • Kerstin Fischer 4 ,
  • Piotr Kędzierski 5 ,
  • Giuseppe Scandone 6 &
  • Thomas Süße 7  

The study analyzes the consequences of the COVID-19 pandemic restrictions for the human–environment relations through the lenses of biophilic design. The mixed-method quantitative and qualitative explanatory research combines contextual and personal variables, such as, among others, country, age group, gender, overcrowding, time spent outside, access to nature/food and the exposure to biophilic elements, during and after the lockdown. The results indicate that psychological pressure on individuals caused by pandemic restrictions imposed early 2020, triggered changes in human-environmental relation. More precisely, our comparative analysis of six European countries (Italy, Germany, Poland, Spain, Denmark and Sweden) indicates that people-environment relations do not depend on the objective severity of country-wise restrictions, but rather on the individual perceptions of these restrictions. The results complement the lack of the research for the role of biophilic design in understanding and enhancing human–environment relations during the COVID-19 pandemic restrictions and thereafter.

Avoid common mistakes on your manuscript.

1 Introduction

The power of nature for physical and mental well-being and the healing role in stressful conditions is undeniable, as human aesthetic, intellectual, cognitive and spiritual cravings are fulfilled [ 1 ]. Exposure to real or simulated natural views can quickly trigger restorative activity in the brain and reduce stress levels [ 2 ]. Contemplating a nature-integrated urban environment can enhance positive emotions [ 3 , 4 ] as deprivation from it can worsen negative states [ 5 ]. In early 2020, the COVID-19 pandemic imposed worldwide restrictions that were unique because of their variety and differing severity which ultimately resulted in graduated states of psychological pressure on individuals [ 6 , 7 ]. However, when people are forced to cope with crises in unusual circumstances, changes are triggered in their patterns of living and working toward a more sustainable and resilient lifestyle [ 8 ]. “Worry” can divert life priorities, but a strong self-concept of nature can serve as a buffer for a moderate impact on environmental values [ 9 ], enabling “salutogenic” experiences out of stress and psychological states even in extreme environments [ 10 ]. The restrictions of the COVID-19 global pandemic highlighted an attention towards Biophilic Design, as it embraces elements of direct and indirect experience of nature.

As personal growth is attributed to societal dynamics and changes in norms [ 11 ], the way people use and perceive the environment can be encouraged or limited by such norms and cultural contexts [ 12 ]. Ironically, despite its far-reaching negative effects, the pandemic also created a new environment for self-reflection and changes in personal perceptions and actions toward its natural surroundings [ 13 ] and the individuals’ lifestyle choices, such as choice of food, housing, mobility, etc.; factors which are mainly not beyond individual control [ 14 ]. While the ecological footprint and environmental impact of the crisis have been widely considered in research of sustainable design practices [ 15 ], biophilic design (BD) encompasses the mutual benefits of connecting with nature to both humans (physiological and psychological benefits) and the environment [ 16 ]. Therefore, this study aims to understand the effects of the COVID-19 pandemic on changes of people-environment relations considering objective as well as perceived severity of lifestyle restrictions. Beyond this, the research aims to depict a pattern of predictive variables regarding the likelihood of integrating BD in future life. We do so in combining a country-comparative research approach, explorative quantitative and qualitative analysis of intended BD changes.

2 Conceptual background

The term ‘biophilia’ (love of life), composed of the ancient Greek words for “life” (bio) and “love” (philia), describes harmonious relationships between humans and the biosphere [ 17 ]. The term was first used by Erich Fromm in “The Heart of Man” (1964) and later by Edward Wilson in “Biophilia” [ 18 ]. The individual’s physiological and psychological response to nature enables the effects of BD elements such as the direct experience of nature in the built environment (natural light, air, plants, animals, water, landscapes), the indirect experience of nature (contact with the representation or image of nature, natural materials, etc.) and the experience of space and place (spatial features of prospect refuge, etc.) [ 19 , 20 ].

The human response to design stimuli allows BD elements to improve quality and sustainability by enhancing health and well-being, productivity, biodiversity, circularity, and resilience [ 21 ]. The green building movement in the early 1990s enforced the link between improved environmental quality and worker productivity [ 22 ] through the use of BD to connect with the indoor environment. Additional benefits include addressing workplace stress, student performance, patient recovery, and community cohesiveness [ 23 ] and improving well-being in prisons [ 24 ]. The BD elements in the landscape (even those not perceived as such) enable the incorporation of diverse strategies into the built environment [ 25 ]. In urban settings, these elements, the new “Hanging Gardens of Babylon”, are indicators of sustainability and resilience. When there is freedom to choose for home or workplace environment relations, the choice is often dominated by a viewpoint with a generous prospect, elevated position, open, savanna-like terrain, proximity to a body of water, etc. [ 26 ].

BD research is mainly related to two theoretical concepts from environmental psychology: The first is Stress Reduction Theory (SRT) [ 27 , 28 ] which explains the extent to which contemplating nature can trigger restorative activity in the brain, which in turn is responsible for reduced stress levels and positive emotions. The second explanatory concept comes from Attention Restoration Theory (ART) [ 29 ], which states that a lack of concentration as well as mental fatigue, which can be attributed to a prolonged direct attention span, can be positively influenced by a visual or physical stay in nature and the increase in concentration can be achieved through restorative processes with less energy-draining attention [ 30 ]. When a person is facing an unpleasant and stressful change in its person-environment relation because of a perceived external behavioral control, patterns of BD (see Annex 2) and biophilia values [ 31 ] can trigger individual restorative responses [ 10 ]. These responses might ultimately result in an adjustment of the environmental surroundings, or at least enhance a person`s motivation to do so by effecting the likelihood of using BD. Mindsponge Theory [ 32 ] conceptualizes this relation of perceived behavioral control in nature and models it with intentions towards behavior change [ 33 ]. A crucial part of this systemic conceptualization of the person-environment relation is the element of “perception of external information by the sensory systems, such as visual or auditory information” [ 34 ].

Literature shows that pandemic restrictions have divergent but socioeconomically moderate psychological effects (either positively or negatively related to states of stress) and that enforced restrictions can be perceived differently (i.e., at the individual level). In addition, deprivation from one of the domains can have such great importance that it can dominate the totality of the measures and, as a consequence, can result in a perceived stronger severity of the measures despite the moderate or weak objective status of the country [ 35 ]. The severe restrictions imposed in Europe and all over the world (although differing across countries), especially during the first wave of the pandemic (March–June 2020), limited life choices [ 36 , 37 ]. These restrictions were accompanied by psychological distress and a decrease in psychological well-being in the general public [ 38 ], among others, due to limited access to physical activity, lack of blue/green landscapes, views of nature from home [ 39 ] and remote interactions, which caused loneliness, especially for women and younger adults [ 40 ]. During this period in Italy, the lack of adequate space, terraces and gardens resulted in increased stress and aggressiveness [ 41 ], where the correlation with the “home satisfaction” factor in those conditions was related to spatial features of adequacy, flexibility, and crowding [ 42 ]. As human risk perception can lead to immediate action, in France, hours and days before the lockdown, people moved from their homes to other places, closer to family, or with better living conditions in terms of size, crowding, landscape, etc. [ 43 ]. Additional challenges in the living environment were also due to the necessity of adapting to working from home [ 44 , 45 ]. During the first wave, for instance, more than 60% of the workers in Germany were obligated to work from home, confronting the lack of a separable home-office working space and triggering a large-scale invasion of work into the private sphere [ 46 ].

The related post pandemic research has analyzed the role of biophilic features for recovery from COVID-19. Afacan (2021) explores the role of biophilic design in enhancing psychological resilience during the pandemic, related to recovery tension mood, depression and anger [ 47 ]. Furthermore, integrating natural elements into both residential and public spaces, especially in times of crisis, can significantly improve mental and physical health and foster a sense of community and connection [ 48 ]. BD principles are vital for enhancing post-pandemic living spaces, through maximizing natural light and ventilation, incorporating plants and green spaces, using natural materials, and designing flexible, multi-functional spaces. These approaches not only create aesthetically pleasing environments but also support well-being and sustainability, making living spaces more adaptable and resilient to future crises [ 49 ]. Incorporating natural elements into architectural design not only create aesthetically pleasing environments but also encourage deeper connections with nature, leading to healthier, more resilient living spaces, and better mental and physical health [ 50 ]. Furthermore, investigations on the relevance of various influential factors for the efficiency and effectiveness of working from home, for physical and mental well-being have been conducted [ 51 , 52 ]. There is a need for coordinated cross-disciplinary research to address COVID-19's mental health impacts and understanding the pandemic's psychological effects during and after the pandemic [ 53 ]. However, the role of BD as an indicator of enhancing connection between nature and humans triggered by lived experiences during the pandemic is under-researched. This study aims to fill this gap, by analyzing people-environment relations following COVID-19 pandemic lifestyle restrictions, through a multinational, explorative analysis of intended BD changes.

3 Systematization of restrictive measures during the pandemic in Italy, Germany, Poland, Spain, Denmark and Sweden

The restrictive measures taken during the COVID-19 pandemic for most of the countries, consists on establishing lockdowns, declaring state of emergency, ban on outside activities, border and travel/international flights, and events. The Oxford COVID-19 Government Response Tracker (OxCGRT) defines the stringency of the measures in eight domains: school and workplace closings, canceling public events, restrictions on gathering size, close public transport, stay-at-home requirements, restrictions on internal movement and international travel [ 54 , 55 ]. These restrictions were considered as basis for defining the comparative groups, contrasting case/country selections: (1) countries that experienced strong/moderate restrictions (Italy, Spain, Germany and Poland) and ‘(2) countries with relatively weak restrictions (Denmark and Sweden). Countries are used as proxies, not considering internal differences (i.e., Italy, “in November 2020, was divided into three zones (red, orange, and yellow)) depending on the severity of the outbreak, with different restrictions applied in each zone.

The six selected countries were affected differently by the pandemic, as reflected by the varying severity of the measures taken. During the first wave of the pandemic, the state of emergency was declared in Italy, Spain, and Denmark. The lockdown was implemented in Italy, Spain and Poland, partially in Germany, while Denmark and Sweden had no national lockdowns (see Table  1 , below) for an overview of the restrictions considering the above domains, plus the lockdown status and the state of the emergency in the six countries). Sections  3.1 . and 3.2 . display detailed illustrations of the restrictive measures in the groups.

3.1 Countries with strong/moderate restrictions, Italy and Spain

Italy, one of the first countries in Europe to be heavily impacted by the COVID-19 pandemic, took a series of strict measures to curb the spread of the virus, with a significant impact on citizens, among the positioned in the group of countries with more self-protective measures [ 56 ]. The Italian government declared a state of emergency on January 31st, 2020 (which lasted until 1st of April 2022), and a nationwide lockdown on March 9th, 2020, closing all nonessential businesses and allowing people to leave their homes for essential reasons, such as buying groceries or going to the doctor [ 57 ]. Face masks were mandatory in all public spaces, and social distancing was enforced. In the summer of 2020, restrictions were gradually lifted, and people were allowed to travel within the country for tourism purposes. However, new restrictions were imposed in the fall due to a rise in cases; nevertheless, these restrictions were less severe, although social distancing was still enforced. Vaccination campaigns were underway, and people who were fully vaccinated had more freedom, such as attending events and travelling abroad. The pandemic resulted in an increase in remote work, with many companies allowing their employees to work from home [ 58 ]. This consequently led people to move out of cities toward smaller towns and villages with more affordable housing and space. The pandemic has accelerated the trend of suburbanization in Italy, with more people looking for larger homes with outdoor spaces [ 59 ]. According to the Digital Innovation Observatories of the School of Management of the Politecnico di Milano, in 2022, there were approximately 3.6 million remote workers, almost 500 thousand fewer than in 2021, with a decrease in particular in the Public Administration (PA) and Small Medium Enterprises (SMEs); however, there is slight but constant growth in large companies, which, with 1.84 million workers, accounted for approximately half of the total smart workers. Despite this, there is increased awareness and action in organizations to create workspace environments that motivate and give meaning to work in the office. Approximately 52% of large companies, 30% of SMEs and 25% of PAs have already carried out interventions to modify the environment or are doing so in recent months. In the future, these initiatives are planned or under evaluation for 26% of large companies, 21% of public administrations and 14% of SMEs [ 60 ]. Furthermore, during the pandemic parks and public gardens have become more important as places for people to exercise and relax while maintaining social distance. A study conducted in Italy during the first COVID-19 pandemic wave (April–May 2020) highlighted the fact that restrictions influenced citizens’ perceptions of urban green spaces, with a consequent increase in general interest in parks and public garden [ 61 ].

Spain was one of the European countries with the highest incidence during the first wave [ 62 ], and in the global context, Spain experienced one of the worst situations [ 63 ]. The government imposed a nationwide lockdown by mid-March 2020, which included the prohibition of nonessential transit and blanket recommendations for WFH. Easing measures started later in May through several phases. Having high heterogeneity across the territory, the lifting of limitations would progress through the phases as rearranged during the process, making the progressive lifting of the restrictions challenging due to the highly decentralized system [ 64 ]. First, outside exercise was allowed, but borders remained closed, and no travel between different territorial units was permitted. Face masks were highly recommended both on public transport and outside. Afterward, shops, food markets and restaurants reopened with social distancing and reduced capacity, while public transit reopened with full service but reduced passenger numbers. The country entered a ‘new normality’ in late June, where travel between provinces was also allowed again. These restrictions were difficult to implement because of the conflicted political environment, which resulted in some territorial administrations taking preliminary measures at the subnational level in an uncoordinated manner [ 65 ]. Apart from the economic impact, the lockdown measures also had a significant social impact in Spain. Official data indicate an increase in gender-based violence (a 48% increase in calls for gender violence helplines during the first weeks of April 2020 compared to before the lockdown) [ 66 ]. Regarding WFH in Spain, nearly 83% of professionals were not granted the opportunity to WFH in 2020, whereas only 9.8% were telecommuting. In 2021, however, there was an increase in home office use to 25.2%. During the first month of the pandemic, there was a 38% reduction in physical activity [ 67 ].

Germany has taken a medium amount of protective measures during the pandemic [ 56 ]. The first contact restrictions were already announced in March 2020, followed by restrictions on travel and the closing of small shops and schools. In April, the obligation to remain in quarantine for 2 weeks when returning from another country and the recommendation to wear a face mask were lifted. Soon after, face masks were required for public transportation. In May 2020, schools and small shops opened slowly again. Contact restrictions depended on the number of cases in the district. In October, there were increasing lockdowns and contact restrictions introduced across the country, which lasted until January 2021. In August 2021, shops, restaurants, etc., are increasingly being opened for people who have gone through an infection, have been vaccinated twice or have a recent negative test. In December, these opportunities held wide only for people who were fully vaccinated or had recovered. The pandemic resulted in intensified suburbanization in Germany during 2020, although this process has steadily increased due to internal migration because of the lowest rates since the mid-1990s [ 68 ]. The residential green spaces attached to residential buildings, mostly designed with “semi-public access”, were appreciated by residents as creating refugia in challenging times and were more actively used than they were in prepandemic times, especially when sitting in parks was not allowed [ 69 ]. During the pandemic in Hamburg, a larger number of visitors were recorded in protected nature areas and local nature reserves, to an extent causing considerable problems for the wildlife there. For instance, in the nature reserve Duvenstedter Brook, many people nature parks more as urban recreational areas, chasing deer or playing badminton on protected stretches of heath.

Poland experienced a relatively mild first phase of the pandemic compared to other European countries, where the first case was identified a month later than in Germany and France. The government declared lockdown and enforced self-isolation measures (24 March 2020) and even applied measures that were not yet recommended by international institutions; for example, the first EU country to shut down its external borders, including those with other EU Member States. Factors slowing the progression of the initial phases of the pandemic include the relatively younger population compared to the most affected European countries, the larger population living in rural areas and the low rate of mobility domestically and internationally among the Polish people [ 70 ]. The impact of COVID-19 also resulted in changes in real estate and suburbanization in Poland, driving a wave of people to buy property (houses with land plots) to escape the dread of living in apartment buildings, either occasionally or permanently [ 71 ]. In addition, the pandemic fostered many measures by the Polish government related to the economy, taxes, employment and extraordinary changes in court proceedings and the system of justice.

3.2 Countries with weak restrictions, Denmark and Sweden

In Denmark, the anti-COVID-19 measures taken were comparatively brief [ 72 ]. Denmark was indeed in the group of countries in which there were fewer self-protective measures [ 56 ]. Starting in the middle of March 2020, schools, public institutions, hairdressers, restaurants, shopping malls, etc., were closed down, but on April 20th, 2020, these were opened again, with fitness centers and swimming pools being the last ones to open again on June 1st. Over the course of the summer, face masks were first recommended and then needed, first for public transport and later for restaurants, shops and public institutions. In northern Jutland, seven municipalities were completely isolated from their environment due to an outbreak of a new variant on some mink farms. Apart from those restrictions, no mobility restrictions were imposed within the country. This changed shortly before Christmas 2020, when schools, restaurants, public institutions, theatres, etc., were closed for almost two months; starting on February 28th, the country started opening everything again. On May 21st, almost all restrictions were lifted. Furthermore, many Danes have their own houses, gardens and/or summer houses. In 2020, 2.7 million Danes lived in detached houses, whereas 1.6 million lived in multiunit houses where they did not own themselves; this relationship has not changed between 2021 and 2022 [ 73 ]. Given the short period in which public life was restricted and given access to nature for a large proportion of the population, it can be expected that the impact of COVID-19 measures on the Danish population’s attitude towards the environment is not very pronounced.

Sweden chose a different strategy during the pandemic, mainly based on voluntary measures and citizen behaviors and recommendations rather than restrictions, and a complete lockdown was never implemented (Sweden country snapshot: public health agencies and services in the response to COVID-19, according to World Health Organization WHO. No state of emergency was declared because the Swedish Constitution does not provide for a state of emergency during a public health crisis. This less rigid approach focused more on mitigation measures for slowing, but not stopping, the pandemic and relied on existing high levels of institutional and interpersonal trust. The affected geographical regions or households were not under enforced quarantine, and facemasks were not recommended outside health care [ 74 ]. Recommendations consisted mainly of “staying at home even with the slightest symptom of an infection, physical distancing, enhanced hygiene measures, avoiding public transportation, and working from home if possible” [ 75 ]. Physical distancing was recommended in public spaces but mandatory in bars, restaurants and events. A maximum of 50 people was allowed to gather. In some opinions, this was considered to have caused less serious consequences than did the severe policies used in most countries [ 76 ]. WFH, which accounts for approximately 40% of the total workforce in Sweden and is independent of previous work experience, influences the establishment of new habits [ 77 ]. Studies suggest that workload, performance and well-being decreased during the pandemic [ 78 ].

Assuming positive effects on well-being through a stronger connection between individuals and the natural environment while also considering the unusual circumstances of the world pandemic, this study addresses individuals’ likelihood of using BD after the pandemic. More precisely, the research interest stretches out to identify contextual variables (country, overcrowding, time spent outside, and access to nature/food,) and personal variables (age group and sex) influencing the likelihood of using BD by focusing on the following:

Individuals’ exposure (during and after the lockdown) to several BD elements intentionally or unintentionally, including indoors (color, water, air, sunlight, plants, animals, natural materials, views and vistas, façade greening, geology and landscape, habitats and ecosystems) and outdoors (location, green neighborhood, wide prospect, proximity to natural resorts, etc.), as reported through a questionnaire to test whether the severity of the lockdown restriction of the COVID-19 pandemic fostered stronger people-environment relations, as valued by the likelihood of using BD elements.

The role of the context of (strong/weak) restrictions in several European countries (Italy, Germany, Poland, Spain, Denmark and Sweden) to test whether people-environment relations differ according to the objective and/or perceived severity of the measures in a country context.

The study design was exploratory, mixed-method, cross-sectoral and comparative. The data were collected through a survey directed to European countries via an online Google form conducted from 30 January to 28 February 2023 (see Annex 1). Following a random, uncontrolled sampling strategy, the survey was shared with learning networks such as the Bosch Alumni Network (an international network across 140 countries currently hosting more than 8000 members), the network of European RCEs (Regional Centers of Expertise on Education for Sustainable Development), and the COST (European Cooperation in Science and Technology) action networks of Indoor Air Pollution and Circular city, as well as with practitioners of several universities in Europe. The questions were closed and open (with the purpose of revealing the unexpected elements of change) and organized into four sections: (1) background questions containing variables such as age (different generations have different attitudes and approaches to restrictions), gender and countries (which are used as proxies); (2) questions about exposure to BD elements (color, water, air, sunlight, plants, animals, views and vistas, geology and landscape, habitats and ecosystems) indoors and outdoors, before and after the lockdown; based on the indoor and outdoor elements of the BD [ 80 , 81 , 82 ], from the framework of 14 Biophilic Patterns [ 31 ]; (3) questions about flexibility and adaptation of the living environment after the lockdown concerning the elements of BD; and (4) additional information on any major changes incentivized by the lockdown restrictions concerning lifestyle and wellbeing.

The data were processed through mixed methods, namely, descriptive statistics, statistical model building and testing and a thematic analysis [ 83 ] of the qualitative data using affinity diagramming (Lucero) [ 84 ]. Table 2 provides an overview of the methodological approach.

To facilitate quantitative correspondence analyses [ 88 , 89 ], numerical variables were recorded in three modalities (less, same, more) or (low, intermediate, high). Data visualizations were derived as two-dimensional planes using the FactoMiner package [ 85 ] of R statistical software [ 90 ]. Finally, a logistic model of the included variables was estimated to explain the likelihood of including BD using R statistical software.

The qualitative data were analyzed using affinity diagrams [ 84 ] to identify the categories that emerged. Affinity diagramming is a variant of thematic analysis [ 83 ]. In this process, all the comments were printed in different colors depending on the country and were clustered and labeled in several iterative steps so that the categories emerged bottom up in several different steps: (a) initial familiarization with the data, (b) creating initial codes, (c) collating codes with supporting quotes, (d) grouping codes into themes, (e) reviewing and revising themes, and (f) writing up the narrative based on the categories emerging. This thematic analysis process was carried out using the affinity mapping technique by creating large visual representations of the data points (chart making and ‘walking the wall’). Affinity diagrams allow identifying patterns in participants’ answers, illustrating what consequences of the restrictions on their lifestyles they were foregrounding themselves. For the quantification of the comments, the instances in each category per country were counted and divided by the total number of comments for each country, in line with the recommendations provided by Lucero [ 84 ]. For instance, seven Swedish participants made a comment that reported a change toward a healthier lifestyle, which corresponds to 16.7% of the total number of comments (42) made by the Swedish participants.

5.1 General descriptive analyses

The 403 participants in the survey were mainly from European Union countries and the United Kingdom (89%), such as Italy (17%), Germany (16%), Sweden (14%), Poland (11%), Denmark (9%), Spain (9%), the UK (3%), France (2%), and other EU countries (8% Czech Republic, Belgium, Greece, Romania, Bulgaria, The Netherlands, Portugal and Lithuania). The rest were from EU neighboring countries (9%, Albania, Serbia, Bosnia-Herzegovina, Belarus, Moldova and Turkey) and from other countries in the world (2% United States of America, Canada, Cameroon, Jordan, Kenya and Saudi Arabia). Ninety percent of the respondents had a level of education as a graduate/postgraduate from different fields. The majority of respondents belong to the 90-ties and 80-ties (37% and 27%, respectively). The rest were born on 70-ties (16.5%), 60-ties (9.5%), and 2000s (6%) in the 50-ties (2%). A total of 58% of the respondents were female, 41% were male, and 1% other.

The perceived severity of the restrictions from the respondents corresponds with the objective severity of restriction in Italy and Spain (strong) and in Germany and Poland (moderate) for Sweden (weak); for the Danish participants, the restrictive measures are perceived as moderate and strong by the majority of the respondents in contrast with the countries’ weak objective status. Table 3 displays objective restrictions (based on the criteria followed by the Oxford Covid-19 government response tracker; as also displayed in Table  1 ) and subjective restrictions as perceived by the respondents of the six countries. In total, 403 participants in this survey perceived the restrictive measures taken in their countries as strong and moderate.

The descriptive statistics revealed that the most influential variables were (1) overcrowding/limited space discomfort (58%), (2) difficulties to work (50%), and 3) difficulties accessing green spaces (33%). Although no significant limits were reported for the choice of food (only 10%), the respondents reported changes in their nutritional status after the pandemic related to: the use of regional products (52.7%), switching to organic products (48.7%) and growing their own vegetation through urban gardening or farming (43.3%). Approximately 80% of the respondents considered visual and nonvisual connections with nature after the pandemic to be very important.

A comparison of the “time spent outside in nature”, “during” and “after” the pandemic with that “before” the pandemic revealed that “during”, for 43% of the respondents is “less”, and “after” for, 60% of the participants is “more”. One-quarter of the participants had a steady attitude “same” for “before” and “after” the pandemic.

Figure  1 shows the “Likelihood of including Biophilic design” in relation to the “Time spent outside”, indicating that this is more likely for respondents who have spent “more” or “less” time outside. The graph was generated using the data from the six selected countries: Denmark, Germany, Italy, Poland, Spain, and Sweden.

figure 1

Likelihood of including biophilic design in relation to “time spent outside” during the pandemic

Figure  2 shows the exposure to BD in the living environment before and after the pandemic, specifically to the following elements: balcony/terrace; private garden/common garden; green roof/façade; views and vistas from home, green or blue; plants/vegetation growing in home gardens/roofs/vases; glass surfaces, sunlight illumination (dynamic & diffuse light); orientation, ventilation, thermal and airflow variability; natural materials (natural wood grains; leather; stone, fossil textures; bamboo, rattan, dried grasses, cork, organic palette. There were no major changes in the specific elements of BD in indoor living environments despite slight increases in the amount of vegetation growing in home gardens/roofs/vases and in vistas from home and glass surfaces, sunlight and illumination. Nevertheless, the majority of respondents reported that they would like to include these BD elements in the future.

figure 2

Exposure to BD elements before and after the pandemic: balcony/terrace; private garden/common garden; green roof/façade; views and vistas from home, green or blue; plants/vegetation growing in home gardens/roofs/vases; glass surfaces, sunlight illumination (dynamic & diffuse light); orientation, ventilation, thermal and airflow variability; natural materials (natural wood grains; leather; stone, fossil textures; bamboo, rattan, dried grasses, cork, organic palette. Axes x- represents the BD elements. Axes y-represent the % of survey participants

Furthermore, the likelihood of BD outdoors when changing habitation is considered important, especially linked to proximity to urban gardens/green areas (53.1%), proximity to a water body (sea, lake, river, etc.) 45.1%, proximity to rural areas/suburbs (natural terrain with trees and vegetation) 42.7%, proximity to city centers and services (39.1%), proximity to relatives or family (36.7%), and elevated position (i.e., looking downhill or a viewpoint with a wide prospect) by 22.7% of respondents. Other changes are related to working habits, where 65% of respondents preferred flexible virtual/office presence time, 44% fewer working hours, 21% preferred to switch to a full-time home office and only 8% preferred full-time presence. One-quarter of the respondents had changed jobs/occupations after the lockdown. As a result, 63% of participants reported having adopted their home to create space for home office and 39% for recreational activities.

Only a small percentage (11%) of respondents reported having created an indoor individual space for Prospect (an unimpeded view over a distance for surveillance and planning), Refuge (withdrawal from environmental conditions or the main flow of activity, in which the individual is protected from behind and overhead), Mystery (partially obscured views or other sensory devices that entice the individual to travel deeper into the environment) and Risk (an identifiable threat coupled with a reliable safeguard). Furthermore, these elements are considered very likely to be included in the future by the majority of the respondents. Fifty-three percent of the respondents reported other changes, especially related to activities in nature, meditation and self-reflection, work, nutrition towards a more vegetal diet and taking a pet.

5.2 Statistical analysis

A first logistic model was estimated using all the available data (all countries) (see Table  3 ) to evaluate the effect of variables on the high vs. low/moderate Likelihood of including BD elements in the environment”. Variables that have significant descriptive power can be described by three cluster types: age and sex; variables pertaining to the severity of the experience lived during lockdown (overcrowding, limited choice of food, and time spent outside during lockdown); and variables reflecting the need for interaction (pet). The most significant variable is undoubtedly the degree of “overcrowding” experienced during lockdown, which may, as the time spent outside during lockdown, act as a proxy for the experienced severity of the restrictions (as distinguished from the perceived severity and from the objective severity of the restrictions-) see Table  4 ).

Based on these overall results, we conducted narrower correspondence analyses to focus the dependence of the “likelihood” on the above identified variables with respect to the subsample of the six selected cluster countries. Figures 3 , 4 and 5 show the results for the variables “Age”, “Time spent outside during the pandemic” and “Overcrowding”.

figure 3

Correspondence Analyses factors maps for the likelihood of using biophilic design related to the variable: age

figure 4

Correspondence Analyses factors maps for the likelihood of using biophilic design related to the variable: time spent outside during the pandemic

figure 5

Correspondence Analyses factors maps for the likelihood of using biophilic design related to the variable: overcrowding

Figure  3 indicates a moderate “Likelihood of including BD in the environment” for males, while females and individuals who chose the “other” option have more extreme points of view (either low or high likelihood). Figure  4 indicates that the “Likelihood of including BD elements in the environment” is high for respondents who went outside “less” during the pandemic but also for those who went outside “more”. The other dependencies are as follows: the more severe the effect of the lockdown is, the higher the “Likelihood of including BD elements in the environment” as shown in Fig.  5 , for the dependence on overcrowding. An analysis of the variable “Choice of food” did not reveal relevant additional information.

Based on these insights, we ran further analyses on the subsample of 6 countries, aggregated by country. We included objective variables, such as the severity of the restrictions and country. However, neither variable showed a significant influence. After recursively discarding the nonsignificant factors, the final logistic model obtained is displayed in Table  5 .

Two factors appear to correlate the most with the” Likelihood of including BD in the environment” (which was also found by the descriptive analysis):”Time spent outside during lockdown” and the “Will to change the environment”. In particular, it was found that (1) having spent “less” time outside or “more” time outside during lockdown was positively correlated with the likelihood of including BD in the environment, and (2) the higher the willingness to change the environment after experiencing COVID-19 lockdown restrictions was, the higher the likelihood of including BD in the environment. Those who spent “less” time outside during lockdown had approximately two times greater chances of including BD in their environment in the future than those who spent the “same” amount of time outside. Those who spent “more” time outside during lockdown had approximately four times greater chances of including BD in their environment in the future (than those who spent the “same” amount of time outside). Those for whom the “Will to change the environment” is “high” have approximately seven times more chances to include BD in their environment in the future (than those for which the “Will to change the environment is “low”). Those for whom the “Will to change the environment” is “moderate” have approximately the same chances of including BD in their environment in the future (as those for whom the “Will to change the environment” is “low”) (Figs. 6 , 7 and 8 ).

figure 6

Causal structure diagram explaining the likelihood of including biophilic design in person-environment relation

figure 7

Lifestyle changes as reported by the survey participants by country. Axes y represent the % of survey participants and Axes x represent the reporting lifestyle changes

figure 8

The relative role of lifestyle changes as reported by the participants

The strongest combination of factors appears to be: “Will to change the environment”- “high” and “Time spent outside during lockdown”- “more”. For this combination, the probability of including BD in the environment in the future is estimated to be 91%. The weakest combination of factors appears to be: “Will to change the environment”- “low”, and”Time spent outside during lockdown”- “same”. For this combination, the probability of including BD in the environment in the future is estimated to be 24%. Finally, these two variables may act as proxies for the severity of the restrictions experienced (as a subjective indicator of severity rather than objective severity by lockdown measures, as indicated through indices of the Oxford COVID-19 government response tracker).

We conducted further analysis to explain the motivation to include BD elements in the future environment by running a Bayesian causal analysis [ 88 , 89 ] using the R package bnlearn [ 90 ]. The dataset included the six cluster countries. A first Bayesian network representation of the joint distribution of all variables coded at two levels (high vs. low or moderate, more vs. less or same) was obtained and gradually modified to account for causal dependencies while not degrading the fitness measure retained (BIC criterion). For the final model, two main causal factors explain the” Likelihood of including BD elements in the future environment”: “Importance given to BD” and” “Importance of spending time outside”. Moreover, following the causal path, the lockdown had a causal impact on these two factors, mainly through the experience of “Overcrowding during lockdown”, which may be considered a good proxy of the perceived severity of restrictions. The obtained causal structure can be summarized by the following diagram:

As indicated, individuals who experienced strong overcrowding during lockdown are more likely to attach high importance to BD elements in their environment and hence are more likely to change their environment through BD elements and to value spending time outside. Interestingly, people who are less willing to change their environment, give less importance to BD in general and are less likely to have experienced overcrowding during lockdown. Furthermore, people who experienced more overcrowding during lockdown may have had fewer opportunities to go outside during lockdown and may have had problems accessing food, and being “stuck at home”, which explains the relevance of these variables in our logistic regressions. We confirmed this insight by running the following contingency Tables 6 and 7 :

5.3 Qualitative analyses

Altogether and across countries, 32 comments concerned a healthier lifestyle, which amounted to 16.4% of the total number of comments (195). The qualitative analysis of a total of 210 qualitative comments from the Danish, German, Italian, Polish, Spanish and Swedish participants yielded 12 different thematic categories, while participants were free to write as much as they wanted; most participants made exactly one point; only three comments were assigned to two different categories. First, there is one group of 20 comments that seem to refer to the time during the pandemic, not afterwards. In the following, we disregard these comments, leaving us with 193 valid data points. In addition, 44 comments (mostly by the Swedish participants) suggested that there were no changes in their lifestyle due to the pandemic.

The next largest category comprises comments in which participants describe how they perform more outdoor activities. For instance, participants reported on longer walks, spending time in the garden and enjoying fresh air; for example, participants wrote “ I spend more time in my city’s park ”, that they “ try to integrate exercise and fresh air ” or that they “ take time to enjoy nature .” Relatedly, eighteen participants reported having resumed a healthier lifestyle, eating less meat, performing more physical activity and cooking more at home. One participant writes, “ The COVID-19 pandemic and lockdown were starting points for changing what I don’t like in my life. The years after the lockdown were full of news in my life and radically changed myself. ” An additional ten participants reported doing more sports, for instance, “ starting fitness at home ”, “ doing more sports ” or doing “ more physical activities outdoors ”. In addition, eleven participants reported more attention given to their mental health. Many have taken up regular meditation exercises, while others report more awareness and appreciation of the small things of life; for instance, some simply take “ more time for meditation ”, another wrote, “ The overwhelming anxiety forced me to become more meditative ” and again, others state that they prioritize time for themselves now and value time alone, outside or time for recreational purposes more.

Like those with greater appreciation of life and nature, ten participants also reported greater appreciation of social connections; one participant reported moving into the city because the countryside was lonely. More generally, participants described spending more time with friends and family; many stated that they go out more often and “ spend more time with family ” organize “ weekly potlucks with my friends to create a sense of community ” or “ think about friendship in a new way ”.

Eighteen participants also reported changes in their homes; several had obtained a pet or more plants, i.e., “ took a pet and changed home ”, but some also described creating a separate space for a home office; for instance, “ I created a home office with a good amount of daylight and a view outside ”. Some even bought a house or moved into the countryside—one even changed country. Eight participants reported on improvements to their homes: “ cleaning my room more often and having less things there, making it a productive space ” or “ general home and garden improvements ”. Furthermore, participants reported that they improved the efficiency of their work or achieved a better work-life balance, i.e., by means of “ smart working ” or the use of “ new technologies that allow me to rest better ”. Eight participants had found a new job or changed careers. Three reported new skills that they acquired during the pandemic. Finally, three participants state that they are still careful and attend to social distancing; for instance, they say that they are still “ careful ”, “ keep distance from people ” and “ pay more attention to hygiene in public spaces ”. A family in Poland sold their apartment to buy a house in the suburbs, which they had never considered before the pandemic, to obtain more living space and better air quality, with a sizable plot of land, a vegetable garden, and a small orchard.

Participants’ qualitative data thus align with the quantitative survey data, suggesting that the pandemic indeed had an effect on lifestyle choices such that they became more attentive to nature, value nature and outdoor activities in the sun and fresh air. For some, these new preferences have led to changes in habitation. Regarding the differences among the six countries under consideration, the Polish participants reported more career changes than did the participants from other countries and less than the effect of the pandemic on their outdoor activities and attention to mental health and physical activity; however, they seem to have used the pandemic widely to change their eating habits. Among the Swedish, Polish and Danish participants, many reported no changes; especially among the Swedish participants, 38.1% stated that there were no changes at all. These results are in line with the severity of the restrictions imposed in these countries during the pandemic. Furthermore, extended social distancing effects were mentioned only by German and Swedish participants. In all other respects, participants from all six countries reported similar changes in lifestyle; this is interesting in its own right since across countries, people seem to have used the pandemic to rethink their lifestyles, where many of the changes reported are towards a more biophilic and sustainable lifestyle. However, it needs to be stated that these conclusions do only apply for those participants who responded to the qualitative part of the questionnaire.

The qualitative analysis added further insights to the interpretation of our quantitative analysis because it indicated that whether people changed their lifestyles was related to the severity of the country lockdown measures, but how and what changed depended to some extent on the respondents’ personal circumstances. There are also other changes that are not related to BD concepts, such as new jobs, increased efficiency, and new education.

6 Conclusions

The study background reinforces the idea that the COVID-19 pandemic restrictions and lockdowns were accompanied by general discomfort due to alterations in livelihood, work, and activities in nature and in-person social interaction. Based on the findings of this study, it can be argued that the severity of restrictive measures (strong/moderate/weak) imposed by countries during the global COVID-19 pandemic influenced the likelihood of including BD for changes in people-environment relationships. However, this effect occurs mainly when restrictions are individually perceived or experienced as severe due to personal circumstances. Our findings suggest that pandemic restrictions triggered a motivation to include BD and by this to change the person-environment relation. In this regard, individual-level perceived severity of restrictions appears to be a stronger proxy of the intended environmental behavior change than country-level objective indictors. These findings underline the relevance of the element of individual environment perception within the systemic reasoning of mindsponge theory [ 22 ]. It can also be concluded, that, in times of perceived crisis of the person-environment relation, the restorative effects of BD proposed by SRT [] and ART [] may serve as an explanation of an individual`s intention to change the lifestyle towards more natural surroundings. The mixed method analyses conducted at different sample levels revealed that, despite drastic social distancing measures, the experienced discomfort created by “Overcrowding” was identified as the most influential variable in relation to the “Time spent outside nature”, independent of other variables such as country, gender, and age. Likewise, the comparative data evaluation between Italy, Spain, Germany, Poland, Sweden and Denmark does not reinforce the assumption that people-environment relations differ according to the severity of the measures in a country context but rather according to individual responses to crises.

Experienced restrictions (mainly defined by “Overcrowding” and “Time spent outside in nature”) influence the likelihood of including BD elements in the future. Individuals who experienced changes in their individual person-environment relations through pandemic restrictions by having “more” or “less” access to nature were more likely to change their behavior, as indicated by the use of BD elements, both indoors and outdoors, compared to individuals who experienced “same” (unaffected) nature access. However, “Overcrowding” here is also subjective to individual perceptions (unrelated, for example, to the official definition for person/sqm2) and may have created more discomfort for individuals who were less likely to have had the opportunity to go outside during lockdown and may have had problems accessing food, being “stuck at home” or “difficulties to work”, as also explained by the role of these variables in the logistic regressions.

The COVID-19 pandemic appears to have influenced the trend toward relocation in proximity to urban gardens/green areas, water (sea, lake, river, etc.) or to rural areas/suburbs and not so much in specific elements indoors (as supported by the suburbanization statistics in the 6 countries). Often, the desired changes contradict with possibilities, as the results indicate an appreciation of natural elements but not always being possible to change, linked with “whenever is freedom to choose”.

To conclude, there is a post pandemic tendency toward greater connection with nature and healthier habits regarding nutrition and lifestyle within the freedom of individual choice. The likelihood of changing the environment through BD elements is related to experienced and perceived changes during the pandemic. Individuals who were experiencing “Overcrowding” are more likely to place high importance on BD elements in their environment now and hence are more willing to change their person-environment relationships. Given these preconditions, respondents revealed higher likelihood of including BD elements in the future, while a relatively unchanged routine during the pandemic did not result in post pandemic changes or increased attention to BD elements, both indoors and outdoors.

7 Limitations and implications for theory and practice

This is a cross-sectoral and not longitudinal survey-based data collection study due to the impossibility of starting a survey during the pandemic. Thus, we referred to people recalling, considering the survey time in early 2023, when the pandemic had not yet officially ended. For the purpose of this study, the survey questions that relate to the perceived severity of the restrictions are limited to discomfort by overcrowding, difficulties working, access to green spaces, exposure to BD elements and choice of food, excluding other elements, for example, travelling (local and international), and access to social events and contacts.

Regarding the lifestyle changes reported by the participants, it must be considered that all those who did not complete the open question in the survey may not have actually changed anything (the reason for not providing information on lifestyle changes is not known). Furthermore, when reporting on their lifestyle changes, the participants had recently been primed to consider biophilic lifestyle changes since they were asked about those changes in the questions before. The order in which the questions were asked may thus have introduced a certain bias regarding the reporting of biophilic lifestyle changes, meaning that these changes may have occurred, but people may have been less inclined to report on other kinds of changes.

Our study relies on data of a specific demographic group, particularly educational background, which may not be representative of the general population. The survey was distributed among participants from particular organizations or institutions, and 90% of the respondents had a higher level of education which may not mirror a generalization for the broader population. Therefore, our evaluations must be interpreted in light of the preferences of this specific demographic group. Furthermore, our conclusions consider the data of only six countries. While the country selection was specifically motivated by a classification of objective severity of measures, we understand, that a different or extended set of countries may have resulted in different results and interpretations. Future research can extend the range and heterogeneity of the data based on country and further demographic variables to enhance generalizability of conclusions. At the same time, this may serve as an evaluation of our conclusion, that it is the perceived severity (at most over-crowding) rather than the objective severity of restrictions that has the stronger impact on the likelihood of including BD in future life. To this regard, a shift of level of analysis towards regional or local surroundings may provide further insights on the relevance of (perceived) overcrowding for the likelihood of BD in person-environment relations. This will ultimately help extending the scope of this research approach beyond effects related to COVID-19 pandemic.

The pandemic highlighted the value of nature in cities and the living environment for the health and well-being of citizens and it uncovered the unhealthy aspects of current urbanization and living-working styles. Our qualitative data underline the general finding that the pandemic has increased the trend of suburbanization, stressing that accessible urban nature is a key component of creating sustainable urban communities and human health and well-being. However, we need to point out, that the majority of respondents in our study wishes/plans for changes rather than does. As such, our results emphasize that proper societal structures and long-term measures are important for enabling larger-scale changes in people-environment relations. Future longitudinal studies will have to find out if such measures will be effective.

Data availability

The datasets generated during and/or analyses during the current study are available from the corresponding author upon reasonable request.

Kellert SR. Introduction. In: Kellert SR, Wilson EO, editors. The biophilia hypothesis. Washington, DC: Island Press; 1993.

Google Scholar  

Ulrich R, Simons R, Losito B, Fiorito E, Miles M, Zelson M. Stress Recovery During Exposure to Natural and Urban Environments. J Environ Psychol. 1991;11:201–30. https://doi.org/10.1016/S0272-4944(05)80184-7 .

Article   Google Scholar  

Li Q, Otsuka T, Kobayashi M, Wakayama Y, Inagaki H, Katsumata M, Kagawa T. Acute effects of walking in forest environments on cardiovascular and metabolic parameters. Eur J Appl Physiol. 2011;111:2845–53. https://doi.org/10.1007/s00421-011-1918-z .

Article   CAS   Google Scholar  

Berto R. The role of nature in coping with psycho-physiological stress: a literature review on restorativeness. Behav Sci. 2014;4(4):394–409. https://doi.org/10.3390/bs4040394 .

Downton P, Jones D, Zeunert J, Roös P. Biophilic design applications: putting theory and patterns into built environment practice. Knowledge E. 2017. https://doi.org/10.18502/keg.v2i2.596 .

Lee KO, Mai KM, Park S. Green space accessibility helps buffer declined mental health during the COVID-19 pandemic: evidence from big data in the United Kingdom. Nat Mental Health. 2023;1:124–34. https://doi.org/10.1038/s44220-023-00018-y .

Wang J, Fan Y, Palacios J, et al. Global evidence of expressed sentiment alterations during the COVID-19 pandemic. Nat Hum Behav. 2022;6:349–58. https://doi.org/10.1038/s41562-022-01312-y .

European Environmental Agency, EEA. (2022). Strand, R., Kovacic, Z., Funtowicz, S. (European Centre for Governance in Complexity) Benini, L., Jesus, A. (EEA) COVID-19: lessons for sustainability?—European Environment Agency (europa.eu).

Sneddon J, Daniel E, Fischer R, et al. The impact of the COVID-19 pandemic on environmental values. Sustain Sci. 2022;17:2155–63. https://doi.org/10.1007/s11625-022-01151-w .

Nicolas M, Martinent G, Palinkas L, Suedfeld P. Dynamics of stress and recovery and relationships with perceived environmental mastery in extreme environments. J Environ Psychol. 2022;83:101853. https://doi.org/10.1016/j.jenvp.2022.101853 .

Yabe T, Bueno BGB, Dong X, Pentland A, Moro E. Behavioral changes during the COVID-19 pandemic decreased income diversity of urban encounters. Nat Commun. 2023;14(1):2310. https://doi.org/10.1038/s41467-023-37913-y .

Rainisio N, Inghilleri P. Culture, environmental psychology, and well-being: An emergent theoretical framework. In: Rainisio N, editor. Well-being and cultures: Perspectives from positive psychology. Dordrecht: Springer, Netherlands; 2012. p. 103–16. https://doi.org/10.1007/978-94-007-4611-4_7 .

Chapter   Google Scholar  

Kalantidou E. Not going back to normal: designing psychologies toward environmental and social resilience. Hu Arenas. 2023;6:131–46. https://doi.org/10.1007/s42087-021-00198-y .

Akenji L. Consumer scapegoatism and limits to green consumerism. J Clean Prod. 2014;63:13–23. https://doi.org/10.1016/j.jclepro.2013.05.022 .

Mihelcic JR, Zimmerman JB. Environmental engineering: fundamentals, sustainability, design. Hoboken: John Wiley & Sons; 2014.

Wijesooriya N, Brambilla A. Bridging biophilic design and environmentally sustainable design: a critical review. J Cleaner Prod. 2021;283: 124591. https://doi.org/10.1016/j.jclepro.2020.124591 .

Barbiero G, Berto R. Biophilia as evolutionary adaptation: an onto-and phylogenetic framework for biophilic design. Front Psychol. 2021;12: 700709. https://doi.org/10.3389/fpsyg.2021.700709 .

Wilson EO. Biophilia. Cambridge: Harvard University Press; 1984.

Book   Google Scholar  

Kellert SR, Wilson EO. The biophilia hypothesis. Washington, DC: Island Press; 1993.

Kellert SR, Calabrese EF. 2015. The practice of biophilic design. http://www.biophilic-design.com

Zhong W, Schröder T, Bekkering J. Biophilic design in architecture and its contributions to health, well-being, and sustainability: a critical review. Front Architect Res. 2022;11(1):114–41. https://doi.org/10.1016/j.foar.2021.07.006 .

Romm J, Browning WD. Greening the building and the bottom line. Terrapin Bright Green; 1994. https://terrapinbrightgreen.com

Patterns of Biophilic Design (terrapinbrightgreen.com)

Söderlund J, Newman P. Improving Mental Health in Prisons Through Biophilic Design. The Prison Journal. 2017;97(6):750–72. https://doi.org/10.1177/0032885517734516 .

Hady SIMA. Activating biophilic design patterns as a sustainable landscape approach. J Eng Appl Sci. 2021;68:46. https://doi.org/10.1186/s44147-021-00031-x .

Roös P, Downton PJ, D. Zeunert, J. (2016). Biophilia in urban design—patterns and principles for smart Australian cities.

Ulrich RS, Simons RF, Losito BD, Fiorito E, Miles MA, Zelson M. Stress recovery during exposure to natural and urban environments. J Environ Psychol. 1991;11(3):201–30. https://doi.org/10.1016/S0272-4944(05)80184-7 .

Gaekwad JS, Moslehian AS, Roös PB. A meta-analysis of physiological stress responses to natural environments: biophilia and stress recovery theory perspectives. J Environ Psychol. 2023. https://doi.org/10.1016/j.jenvp.2023.102085 .

Kaplan S. The restorative benefits of nature: toward an integrative framework. J Environ Psychol. 1995;15:169–82. https://doi.org/10.1016/0272-4944(95)90001-2 .

Stevenson MP, Schilhab T, Bentsen P. Attention restoration theory II: a systematic review to clarify attention processes affected by exposure to natural environments. J Toxicol Environ Health Part B. 2018;21(4):227–68. https://doi.org/10.1080/10937404.2016.1196155 .

Browning WD, Ryan CO, Clancy JO. 14 patterns of biophilic design. New York: Terrapin Bright Green, LLC; 2014. https://terrapinbrightgreen.com

Vuong Q-H. Mindsponge theory. De Gruyter. 2023. https://doi.org/10.2478/9788367405157-002 .

Nguyen M-H, La V-P, Le T-T, Vuong Q-H. Introduction to Bayesian mindsponge framework analytics: an innovative method for social and psychological research. MethodsX. 2022;9: 101808. https://doi.org/10.1016/j.mex.2022.101808 .

Montemayor C, Haladjian HH. Perception and cognition are largely independent, but still affect each other in systematic ways: arguments from evolution and the consciousness-attention dissociation. Front Psychol. 2017;8:40. https://doi.org/10.3389/fpsyg.2017.00040 .

Ren X. Pandemic and lockdown: a territorial approach to COVID-19 in China, Italy and the United States. Eur Geogr Econ. 2020;61(4–5):423–34. https://doi.org/10.1080/15387216.2020.1762103 .

Mækelæ M, Reggev N, Dutra N, Tamayo R, et al. Perceived efficacy of COVID-19 restrictions, reactions and their impact on mental health during the early phase of the outbreak in six countries. Royal Soc Open Sci. 2020;7: 200644. https://doi.org/10.1098/rsos.200644 .

Fors Connolly F, Olofsson J, Malmberg G, Stattin M. SHARE Working Paper Series 62–2021: adjustment of daily activities to restrictions and reported spread of the COVID-19 pandemic across Europe. 2021; https://doi.org/10.17617/2.3292885

Vindegaard N, Benros ME. COVID-19 pandemic and mental health consequences: systematic review of the current evidence. Brain Behav Immun. 2020;89:531–42. https://doi.org/10.1016/j.bbi.2020.05.048 .

Cumbrera MG, Foley R, Correa-Fernández J, González-Marín A, Braçe O, Hewlett D. The importance for wellbeing of having views of nature from and in the home during the COVID-19 pandemic. Results from the GreenCOVID study. J Environ Psychol. 2022;83:101864. https://doi.org/10.1016/j.jenvp.2022.101864 .

Caro JC, Clark AE, D’Ambrosio C, Vögele C. The impact of COVID-19 lockdown stringency on loneliness in five European countries. Soc Sci Med. 2022;314:15492. https://doi.org/10.1016/j.socscimed.2022.115492 .

D’Alessandro D, Gola M, Appolloni L, Dettori M, Fara GM, Rebecchi A, Settimo G, Capolongo S. COVID-19 and living space challenge Well-being and public health recommendations for a healthy, safe, and sustainable housing. Acta Biomed. 2020. https://doi.org/10.23750/abm.v91i9-S.10115 .

Fornara F, Mosca O, Bosco A, et al. Space at home and psychological distress during the Covid-19 lockdown in Italy. J Environ Psychol. 2022;79(2022):101747. https://doi.org/10.1016/j.jenvp.2021.101747 .

Atkinson-Clement C, Pigalle E. What can we learn from Covid-19 pandemic’s impact on human behaviour? The case of France’s lockdown. Hum Soc Sci Commun. 2021;8:81. https://doi.org/10.1057/s41599-021-00749-2 .

Bertoni M, Cavapozzi D, Pasini G, Pavese C. Remote working and mental health during the first wave of the COVID-19 pandemic. SSRN Electron J. 2022. https://doi.org/10.2139/ssrn.4111999 .

Awada M, Lucas G, Becerik-Gerber B, Roll S. Working from home during the COVID-19 pandemic: Impact on office worker productivity and work experience. Work. 2021. https://doi.org/10.3233/WOR-210301 .

Shulla K, Voigt BF, Cibian S, et al. Effects of COVID-19 on the sustainable development goals (SDGs). Dis Sustain. 2021;2:15. https://doi.org/10.1007/s43621-021-00026-x .

Afacan Y. Impacts of biophilic design on the development of gerotranscendence and the profile of mood states during the COVID-19 pandemic. Ageing Soc. 2021;43:1–25. https://doi.org/10.1017/S0144686X21001860 .

Gür M, Kaprol T. The participation of biophilic design in the design of the post-pandemic living space. Hershey: IGI Global; 2022. https://doi.org/10.4018/978-1-7998-6725-8.ch004 .

Biophilic Design: The future of greener living spaces—Econyl

Role of Biophilic Design in Sustainable Architecture (archiplexgroup.com)

Bolisani E, Scarso E, Ipsen C, Kirchner K, Hansen J. Working from home during COVID-19 pandemic: lessons learned and issues. Manag Market. 2020;15(1):458–76. https://doi.org/10.2478/mmcks-2020-0027 .

Deepa V, Baber H, Shukla B, Sujatha R, Khan D. Does lack of social interaction act as a barrier to effectiveness in work from home? COVID-19 and gender. J Organ Effective. 2023;10(1):94–111. https://doi.org/10.1108/JOEPP-11-2021-0311 .

Holmes EA, O’Connor RC, Perry VH, Tracey I, Wessely S, Arseneault L, Bullmore E. Multidisciplinary research priorities for the COVID-19 pandemic: a call for action for mental health science. Lancet Psychiatry. 2020;7(6):547–60. https://doi.org/10.1016/S2215-0366(20)30168-1 .

Hale T, Angrist N, Goldszmidt R, et al. A global panel database of pandemic policies (Oxford COVID-19 government response tracker). Nat Hum Behav. 2021;5:529–38. https://doi.org/10.1038/s41562-021-01079-8 .

Hale T. 2023. Working paper version 15.0.docx (ox.ac.uk)

SHARE-ERIC 2021. Results of the 1st SHARE Corona Survey; Project SHARE-COVID19 (Project Number 101015924, Report No. 1, March 2021). Munich: SHARE-ERIC. https://doi.org/10.17617/2.3356927

Protezionecivile. Coronavirus: state of emergency ends on March 31. https://www.protezionecivile.gov.it/en/notizia/coronavirus-state-emergency-ends-march-31/

Ipsen C, van Veldhoven M, Kirchner K, Hansen JP. Six key advantages and disadvantages of working from home in Europe during COVID-19. Int J Environ Res Public Health. 2021;18(4):1826. https://doi.org/10.3390/ijerph18041826 .

Kercuku A, Curci F, Lanzani A, Zanfi F. Italia di mezzo: The emerging marginality of intermediate territories between metropolises and inner areas. REGION. 2023;10:89–112. https://doi.org/10.18335/region.v10i1.397 .

Osservatori. Smart working in Italia: Numeri e trend. https://www.osservatori.net/it/ricerche/comunicati-stampa/smart-working-italia-numeri-trend

Larcher F, Pomatto E, Battisti L, Gullino P, Devecchi M. Perceptions of urban green areas during the social distancing period for COVID-19 containment in Italy. Horticulturae. 2021. https://doi.org/10.3390/horticulturae7030055 .

López MG, Chiner-Oms Á, García de Viedma D, et al. The first wave of the COVID-19 epidemic in Spain was associated with early introductions and fast spread of a dominating genetic variant. Nat Genet. 2021;53:1405–14. https://doi.org/10.1038/s41588-021-00936-6 .

Guirao A. The Covid-19 outbreak in Spain. A simple dynamics model, some lessons, and a theoretical framework for control response. Infect Dis Model. 2020;5:652–69. https://doi.org/10.1016/j.idm.2020.08.010 .

Monge S, Latasa Zamalloa P, Sierra Moros MJ, et al. Lifting COVID-19 mitigation measures in Spain (may–june 2020). Enferm Infecc Microbiol Clin. 2023;41(1):11–7. https://doi.org/10.1016/j.eimce.2021.05.019 .

Viguria AU, Casamitjana N. Early interventions and impact of COVID-19 in Spain. Int J Environ Res Public Health. 2021. https://doi.org/10.3390/ijerph18084026 .

Instituto de la Mujer. Impacto de género del COVID-19. 2020. https://www.inmujeres.gob.es/

FitBit the impact of coronavirus on global activity. https://blog.fitbit.com/covid-19-global-activity . Accessed 22 Mar 2020.

Stawarz N, Rosenbaum-Feldbrügge M, Sander N, Sulak H, Knobloch V. The impact of the COVID-19 pandemic on internal migration in Germany: a descriptive analysis. Popul Space Place. 2022. https://doi.org/10.1002/psp.2566 .

Säumel I, Sanft SJ. Crisis mediated new discoveries, claims and encounters: changing use and perception of residential greenery in multistory housing in Berlin, Germany. Urban For Urban Green. 2022. https://doi.org/10.1016/j.ufug.2022.127622 .

Gruszczynski L, Zatoński M, Mckee M. Do regulations matter in fighting the COVID-19 pandemic? Lessons from Poland. European Journal of Risk Regulation. 2021;12(4):739–57. https://doi.org/10.1017/err.2021.53 .

The New York Times. House hunting in Poland: why more buyers are looking there. https://www.nytimes.com/2020/08/26/realestate/house-hunting-poland.html

Statens Serum Institut. COVID-19 ramte verden og Danmark: Se tidslinjen her. https://www.ssi.dk/aktuelt/nyheder/2022/da-covid-19-ramte-verden-og-danmark-se-tidslinjen-her

Statistics Denmark. Housing conditions. https://www.dst.dk/en/Statistik/emner/borgere/boligforhold

Ludvigsson JF. The first eight months of Sweden’s COVID-19 strategy and the key actions and actors that were involved. Acta Paediatr. 2020;109(12):2459–71. https://doi.org/10.1111/apa.15582 .

Carlson J, Tegnell A. Swedish response to COVID-19. China CDC Wkly. 2020. https://doi.org/10.46234/ccdcw2020.215 .

Björkman A, Gisslén M, Gullberg M, Ludvigsson J. The Swedish COVID-19 approach: a scientific dialogue on mitigation policies. Front Public Health. 2023. https://doi.org/10.3389/fpubh.2023.1206732 .

Vilhelmson, et al. Sustained work from home post-pandemic? A Swedish case. Findings. 2023. https://doi.org/10.32866/001c.74470 .

Hallman DM, Januario LB, Mathiassen SE, et al. Working from home during the COVID-19 outbreak in Sweden: effects on 24-h time-use in office workers. BMC Public Health. 2021;21:528. https://doi.org/10.1186/s12889-021-10582-6 .

Politico. Europe’s coronavirus lockdown measures compared. https://www.politico.eu/article/europes-coronavirus-lockdown-measures-compared/

Kellert SR. Dimensions, elements, and attributes of biophilic design. Biophilic Design. 2008;2015:3–19.

Beatley T. Biophilic cities: integrating nature into urban design and planning. Washington, DC: Island Press; 2010.

Beatley T. Handbook of biophilic design. Washington, DC: Island Press; 2016.

Braun V, Clarke V. Using thematic analysis in psychology. Qual Res Psychol. 2006;3(2):77–101.

Lucero A. Using affinity diagrams to evaluate interactive prototypes. In: Abascal J, Barbosa S, Fetter M, Gross T, Palanque P, Winckler M, editors. Human-computer interaction—INTERACT 2015. INTERACT 2015. Lecture notes in computer science, vol. 9297. Cham: Springer; 2015. https://doi.org/10.1007/978-3-319-22668-2_19 .

Lê S, Josse J, Husson F. FactoMineR: an r package for multivariate analysis. J Stat Softw. 2008;25(1):1–18. https://doi.org/10.18637/jss.v025.i015 .

Venables WN, Ripley BD. Modern applied statistics with S. New York: Springer; 2002.

Selvin S. Statistical analysis of epidemiologic data (monographs in epidemiology and biostatistics, V. 35). 3rd ed. Oxford: Oxford University Press; 2004.

Greenacre M. Theory and application of correspondence analysis. London: Academic Press; 1983.

Greenacre M. Correspondence Analysis in Practice. 2nd ed. Boca Raton: Chapman and Hall/CRC; 2007.

R Core Team. R: a language and environment for statistical computing. Vienna: R Foundation for Statistical Computing; 2023.

Download references

Acknowledgements

The study is the third part of the project “Publication series: Sustainability in post pandemic society”, funded by the International Alumni Centre Berlin (iac), a center of excellence funded by the Robert Bosch Stiftung for impact-oriented alumni work and networks in philanthropy.

Open Access funding enabled and organized by Projekt DEAL.

Author information

Authors and affiliations.

University of Bonn, Bonn, Germany

Kalterina Shulla

South Westphalia University of Applied Sciences, Meschede, Germany

Bernd-Friedrich Voigt

LMBA UMR CNRS 6205, University of South Brittany, Lorient, France

Salim Lardjane

University of Southern Denmark, Odense, Denmark

Kerstin Fischer

Otwock, Poland

Piotr Kędzierski

QG Enviro, Lecce, Italy

Giuseppe Scandone

University of Applied Sciences and Arts, Bielefeld, Germany

Thomas Süße

You can also search for this author in PubMed   Google Scholar

Contributions

KSH and BFV contributed to the conceptualization and design, writing and revision and methodological framework, the SL contributed to the statistical analyses, the KF contributed to the qualitative analyses and country background, and the PL, GS and TS contributed to corresponding countries’ background and introduction. All the authors contributed to the data collection.

Corresponding authors

Correspondence to Kalterina Shulla or Bernd-Friedrich Voigt .

Ethics declarations

Ethics approval and consent to participate.

Ethics approval and consent to participate were obtained from all the participants.

Consent for publication

The research does not involve human research participants. The survey was anonymous.

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

1.1 Survey: changes in people-environment relations in a post-pandemic society

Section  : General and background questions

1. Country

2. Decade of birth (1960, 1970, 1980, 1990, 2000)

3. Gender

4. Occupation/level of education

5. In your opinion, the COVID-19 lockdown restrictions in your country/city were relatively (weak, moderate, strong)

6. During the lockdown, did you experience discomfort related to (overcrowding/limited home space, no access to parks or natural areas, recreation, limited choice of food, difficulties to work)?

7. On a scale from 1 (not at all) to 7 (very much), did you have problems accessing parks or green spaces?

8. On a scale from 1 (not at all) to 7 (very much), did you have limited choice of foods?

9. On a scale from 1 (not at all) to 7 (very much), did you experience difficulties to work?

10. Are you a member of the Bosch Alumni Network? (Yes/No)

Section  : Exposure to biophilic design (design that reconnects us with nature and helps healing, reduces stress, improves creativity and our well-being) elements, indoors and outdoors, before and after the lockdown of the COVID-19 pandemic (color, water, air, sunlight, plants, animals, views and vistas, geology and landscape, habitats and ecosystems)

11. The time you spent outside during the lockdown, compared to before that (walking, sport activities; visual (a view to elements of nature living systems and natural processes)/nonvisual (auditory or gustatory stimuli that engender a deliberate and positive reference to nature, living systems or natural processes) connection with nature; in the presence of water (a condition that enhances the experience of a place through seeing, hearing or touching water); and sounds (nonrhythmic sensory stimuli) is? (Less/the same/more)

12. The time you spend outside now compared to before the lockdown (walking, sport activities, visual/nonvisual connection with nature, presence of water, sounds) is? (Less/the same/more)

13. On a scale from 1 to 7, how important are these elements for you?

14. Which of the elements of biophilic design were present in your indoor living environment before the lockdown? balcony/terrace; private garden/common garden; green roof/façade; views and vistas from home, green or blue; plants/vegetation growing in home gardens/roofs/vases; glass surfaces, sunlight illumination (dynamic & diffuse light and varying intensities of light and shadow that change over time to create conditions that occur in nature); orientation, ventilation, thermal and airflow variability (subtle changes in air temperature, relative humidity, airflow across the skin, and surface temperatures that mimic natural environments); any of these natural materials (natural wood grains; leather; stone, fossil textures; bamboo, rattan, dried grasses, cork, organic palette)

15. Which of the elements of biophilic design are present in your living environment now? balcony/terrace/private garden/common garden/green roof/façade/views and vistas from home, green or blue/plants/vegetation growing in home gardens/roofs/vases/glass surfaces, sunlight illumination/Orientation, ventilation, thermal and airflow variability/any of these natural materials (natural wood grains; leather; stone, fossil textures; bamboo, rattan, dried grasses, cork, organic palette)

16. On a scale from 1 to 7 (1- the lowest- 7 the highest), how likely is it that you will include the biophilic design elements in the future in your living environment (if possible)?

Section  : Flexibility and adaptation of the living environment after lockdown in relation to the elements of "Biophilic Design" and "Biophilic Urban Design"

17. Did you adapt your home environment after the lockdown to create space for: Home office/Recreational activities/Individual space for (biophilic elements of Prospect (An unimpeded view over a distance, for surveillance and planning) Refuge (A place for withdrawal from environmental conditions or the main flow of activity, in which the individual is protected from behind and overhead)], Mystery (The promise of more information, achieved through partially obscured views or other sensory devices that entice the individual to travel deeper into the environment) and Risk (An identifiable threat coupled with a reliable safeguard)

18. On a scale from 1 to 7 (1- the lowest- 7 the highest), how likely is that you will change your environment (home office, recreational/physical activities, individual space) in the future?

19. If you changed/would change your habitation after the lockdown, did/would you consider any of the elements of Biophilic Urban Design? Proximity to urban gardens, green areas/proximity to water bodies (see, lake, river, etc.), proximity to city centres and services/proximity to rural areas/suburbs (natural terrain with trees and vegetation), and elevated position (i.e., looking downhill or a view point with a wide prospect/proximity to relatives or family/other (please specify)

20. On a scale 1 less to 7 very, how important are for you the elements of Biophilic Urban Design for choosing your habitation?

21. After the lockdown, would you consider any of these changes in your working conditions: flexible virtual/office presence time/switch to full-time home office/total office presence/change in job/occupation/working fewer hours/other?

22. After the lockdown, would you consider any of these changes related to food: Organic/Regional/Growing your own vegetation through urban gardening or farming?

Section  : Additional information

23. After the lockdown, did you/would you consider taking a pet/animal?

24. Please describe any major changes that you have made incentivized by the lockdown restrictions concerning your lifestyle and wellbeing, based on the above or other factors

2.1 Three categories of the 14 biophilic patterns according to Browning, Ryan and Clancy [ 67 ]

Nature in the Space Patterns

1. Visual Connection with Nature (A view to elements of nature, living systems and natural processes)

2. Non-Visual Connection with Nature (auditory, haptic, olfactory, or gustatory stimuli that engender a deliberate and positive reference to nature, living systems or natural processes)

3. Non-Rhythmic Sensory Stimuli (Stochastic and ephemeral connections with nature that may be analysed statistically but may not be predicted precisely)

4. Thermal & Airflow Variability (Subtle changes in air temperature, relative humidity, airflow across the skin, and surface temperatures that mimic natural environments)

5. Presence of Water (A condition that enhances the experience of a place through seeing, hearing or touching water)

6. Dynamic & Diffuse Light (averages varying intensities of light and shadow that change over time to create conditions that occur in nature)

7. Connection with Natural Systems (Awareness of natural processes, especially seasonal and temporal changes characteristic of a healthy ecosystem)

Natural Analogues Patterns

8. Biomorphic Forms & Patterns (Symbolic references to contoured, patterned, textured or numerical arrangements that persist in nature)

9. Material Connection with Nature (Materials and elements from nature that, through minimal processing, reflect the local ecology or geology and create a distinct sense of place)

10. Complexity & Order (Rich sensory information that adheres to a spatial hierarchy similar to those encountered in nature)

Nature of the Space Patterns

11. Prospect (An unimpeded view over a distance for surveillance and planning)

12. Refuge (A place for withdrawal from environmental conditions or the main flow of activity in which the individual is protected from behind and overhead)

13. Mystery (The promise of more information, achieved through partially obscured views or other sensory devices that entice the individual to travel deeper into the environment)

14. Risk/Peril (An identifiable threat coupled with a reliable safeguard)

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Shulla, K., Voigt, BF., Lardjane, S. et al. People-environment relations following COVID-19 pandemic lifestyle restrictions: a multinational, explorative analysis of intended biophilic design changes. Discov Sustain 5 , 229 (2024). https://doi.org/10.1007/s43621-024-00423-y

Download citation

Received : 16 May 2024

Accepted : 16 August 2024

Published : 02 September 2024

DOI : https://doi.org/10.1007/s43621-024-00423-y

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Biophilic design
  • Environmental psychology
  • People-environment relation
  • COVID-19 pandemic

Advertisement

  • Find a journal
  • Publish with us
  • Track your research

COMMENTS

  1. The Beginner's Guide to Statistical Analysis

    Statistical analysis means investigating trends, patterns, and relationships using quantitative data. It is an important research tool used by scientists, governments, businesses, and other organizations. ... Your choice of statistical test depends on your research questions, research design, sampling method, and data characteristics ...

  2. Introduction to Research Statistical Analysis: An Overview of the

    Introduction. Statistical analysis is necessary for any research project seeking to make quantitative conclusions. The following is a primer for research-based statistical analysis. It is intended to be a high-level overview of appropriate statistical testing, while not diving too deep into any specific methodology.

  3. What Is Statistical Analysis? (Definition, Methods)

    Statistical analysis is useful for research and decision making because it allows us to understand the world around us and draw conclusions by testing our assumptions. Statistical analysis is important for various applications, including: Statistical quality control and analysis in product development. Clinical trials.

  4. Choosing the Right Statistical Test

    When to perform a statistical test. You can perform statistical tests on data that have been collected in a statistically valid manner - either through an experiment, or through observations made using probability sampling methods.. For a statistical test to be valid, your sample size needs to be large enough to approximate the true distribution of the population being studied.

  5. Selection of Appropriate Statistical Methods for Data Analysis

    Type and distribution of the data used. For the same objective, selection of the statistical test is varying as per data types. For the nominal, ordinal, discrete data, we use nonparametric methods while for continuous data, parametric methods as well as nonparametric methods are used.[] For example, in the regression analysis, when our outcome variable is categorical, logistic regression ...

  6. Statistical Methods for Data Analysis: a Comprehensive Guide

    Introduction to Statistical Methods. At its core, statistical methods are the backbone of data analysis, helping us make sense of numbers and patterns in the world around us. Whether you're looking at sales figures, medical research, or even your fitness tracker's data, statistical methods are what turn raw data into useful insights.

  7. Basic statistical tools in research and data analysis

    Abstract. Statistical methods involved in carrying out a study include planning, designing, collecting data, analysing, drawing meaningful interpretation and reporting of the research findings. The statistical analysis gives meaning to the meaningless numbers, thereby breathing life into a lifeless data. The results and inferences are precise ...

  8. Statistical Analysis in Research: Meaning, Methods and Types

    The scientific method is an empirical approach to acquiring new knowledge by making skeptical observations and analyses to develop a meaningful interpretation. It is the basis of research and the primary pillar of modern science. Researchers seek to understand the relationships between factors associated with the phenomena of interest. In some cases, research works with … Statistical ...

  9. Introduction to Statistical Analysis: A Beginner's Guide.

    Statistical analysis plays a pivotal role in achieving this by providing tools and methods to analyze and interpret data accurately. It helps researchers identify patterns, test hypotheses, draw inferences, and quantify the strength of relationships between variables. Understanding the significance of statistical analysis empowers researchers ...

  10. Quantitative Data Analysis Methods & Techniques 101

    The two "branches" of quantitative analysis. As I mentioned, quantitative analysis is powered by statistical analysis methods.There are two main "branches" of statistical methods that are used - descriptive statistics and inferential statistics.In your research, you might only use descriptive statistics, or you might use a mix of both, depending on what you're trying to figure out.

  11. Introduction to Statistical Analysis: Techniques and Applications

    Statistical analysis is the process of collecting and analyzing data in order to discern patterns and trends. It is a method for removing bias from evaluating data by employing numerical analysis. This technique is useful for collecting the interpretations of research, developing statistical models, and planning surveys and studies.

  12. Research Methods

    Qualitative analysis tends to be quite flexible and relies on the researcher's judgement, so you have to reflect carefully on your choices and assumptions and be careful to avoid research bias. Quantitative analysis methods. Quantitative analysis uses numbers and statistics to understand frequencies, averages and correlations (in descriptive ...

  13. What is Statistical Analysis? Types, Methods, Software, Examples

    Statistical analysis encompasses a diverse range of methods and approaches, each suited to different types of data and research questions. Understanding the various types of statistical analysis is essential for selecting the most appropriate technique for your analysis. Let's explore some common distinctions in statistical analysis methods.

  14. Role of Statistics in Research

    Types of Statistical Research Methods. Statistical analysis is the process of analyzing samples of data into patterns or trends that help researchers anticipate situations and make appropriate research conclusions. Based on the type of data, statistical analyses are of the following type: 1. Descriptive Analysis. The descriptive statistical ...

  15. Comprehensive guidelines for appropriate statistical analysis methods

    Background: The selection of statistical analysis methods in research is a critical and nuanced task that requires a scientific and rational approach. Aligning the chosen method with the specifics of the research design and hypothesis is paramount, as it can significantly impact the reliability and quality of the research outcomes.

  16. 7 Types of Statistical Analysis Techniques (And Process Steps)

    3. Data presentation. Data presentation is an extension of data cleaning, as it involves arranging the data for easy analysis. Here, you can use descriptive statistics tools to summarize the data. Data presentation can also help you determine the best way to present the data based on its arrangement. 4.

  17. What Is Statistical Analysis? Definition, Types, and Jobs

    Statistical analysis is the process of collecting and analyzing large volumes of data in order to identify trends and develop valuable insights. In the professional world, statistical analysts take raw data and find correlations between variables to reveal patterns and trends to relevant stakeholders. Working in a wide range of different fields ...

  18. 5 Statistical Analysis Methods for Research and Analysis

    The practice of gathering and analyzing data to identify patterns and trends is known as statistical analysis. It is a method for eliminating bias from data evaluation by using numerical analysis. Data analytics and data analysis are closely related processes that involve extracting insights from data to make informed decisions. And these ...

  19. Sage Research Methods

    A practical 'cut to the chase' handbook that quickly explains the when, where, and how of statistical data analysis as it is used for real-world decision-making in a wide variety of disciplines. In this one-stop reference, the authors provide succinct guidelines for performing an analysis, avoiding pitfalls, interpreting results and ...

  20. Sage Research Methods

    • Real research on specific antisocial behaviors provides consistent context for answering research questions in an interesting and intuitive way. • Discussion of statistical inference in an easy-to-understand manner ensures that students have the foundation they need to avoid misusing hypothesis tests. ... Principles & methods of ...

  21. The Beginner's Guide to Statistical Analysis

    Statistical analysis means investigating trends, patterns, and relationships using quantitative data. It is an important research tool used by scientists, governments, businesses, and other organisations. ... Your choice of statistical test depends on your research questions, research design, sampling method, and data characteristics ...

  22. Best Statistical Analysis Courses Online with Certificates [2024

    In summary, here are 10 of our most popular statistical analysis courses. Introduction to Statistics: Stanford University. Statistical Analysis with R for Public Health: Imperial College London. Data Analysis with R: Duke University. Business Statistics and Analysis: Rice University. Data Analysis with Python: IBM.

  23. A Robust Statistical Analysis of Factors Affecting ...

    The design analysis method by edges revealed that the following factors contributed to an increase in kidney failure cases: h8 (Recurrent urinary tract infection), h6 (Lack of exercise), h7 ...

  24. What is statistical analysis?

    A research hypothesis is your proposed answer to your research question. The research hypothesis usually includes an explanation ("x affects y because …"). A statistical hypothesis, on the other hand, is a mathematical statement about a population parameter. Statistical hypotheses always come in pairs: the null and alternative hypotheses.

  25. Robust identification of perturbed cell types in single-cell ...

    As single-cell datasets continue to expand, our faster and statistically rigorous method offers a robust and versatile tool for a wide range of research and clinical applications, enabling the ...

  26. People-environment relations following COVID-19 pandemic ...

    The study analyzes the consequences of the COVID-19 pandemic restrictions for the human-environment relations through the lenses of biophilic design. The mixed-method quantitative and qualitative explanatory research combines contextual and personal variables, such as, among others, country, age group, gender, overcrowding, time spent outside, access to nature/food and the exposure to ...