Why 96.4% of Psychological Safety Assessments Miss the Point

Measurement

Dec 27

Your organization probably measures psychological safety. Most do now—it's become standard practice since Google's Project Aristotle made it a management buzzword a decade ago.

Here's the problem: you're almost certainly measuring it wrong.

This isn't a minor technical issue. It's a fundamental category error that renders most psychological safety data meaningless for the decisions organizations actually need to make.

After analyzing 667 studies across organizational psychology, a striking pattern emerged: despite 214 thematic mentions of psychological safety, only 6 studies—3.6%—actually measured the construct correctly. The rest measured something. Just not the thing that predicts team performance.

Understanding why this matters requires going back to what psychological safety actually is, and why that definition creates measurement requirements most organizations ignore.

What Psychological Safety Actually Is

In 1999, Harvard researcher Amy Edmondson defined psychological safety as "a shared belief held by members of a team that the team is safe for interpersonal risk-taking."

Three words in that definition create the measurement problem: "shared belief" and "team."

Psychological safety isn't an individual attitude. It's not how safe you feel. It's how safe we collectively perceive our team to be. This distinction sounds academic. It's actually the whole ballgame.

When Edmondson studied 51 work teams, she didn't just ask individuals whether they felt safe. She calculated whether team members' perceptions converged—whether people on the same team gave similar answers, and whether those answers differed from other teams.

The statistical test for this is called an Intraclass Correlation Coefficient (ICC). Edmondson's ICC was .39—meaning 39% of the variation in individual responses was attributable to team membership. People on the same team perceived similar levels of safety. People on different teams perceived different levels.

This validated the construct. Psychological safety existed at the team level, not just in individual heads.

    .39
    Edmondson's ICC for psychological safety—indicating strong team-level agreement and validating that PS is a shared team property, not an individual trait

What Most Organizations Do Instead

Now consider what most organizations actually do when they "measure psychological safety."

They administer a survey. Employees respond individually. Someone averages the responses—either across the whole organization or by department. Leadership receives a report: "Our psychological safety score is 3.8 out of 5."

This approach commits a fundamental measurement error. It treats a team-level construct as an individual-level attitude. It's like measuring "team coordination" by averaging individuals' hand-eye coordination scores.

The problems cascade:

An organization-wide average of 3.8 could mean universal mediocrity—or it could mask excellent teams hiding alongside toxic ones. Without team-level data, you can't tell the difference.

Problem 2: No Validation That Teams Actually Share Perceptions

If team members disagree wildly about psychological safety—one person rates it 6, another rates it 2—averaging those scores produces a meaningless number. True psychological safety requires convergence. If people on the same team don't share perceptions, you're not measuring psychological safety. You're measuring something else: perhaps individual personality differences, or personal relationships with managers.

Proper measurement calculates ICC to verify convergence. Without this step, you can't know whether your data represents actual team-level psychological safety or just noise.

Problem 3: Improvements Get Diluted in Aggregate Data

Imagine you run an intervention. Three teams improve dramatically. Seven teams show no change. Your organization-wide average barely moves.

Leadership concludes: "The intervention didn't work."

Actually, the intervention worked brilliantly in three teams. You just can't see it because you're measuring at the wrong level. The information that would tell you what worked and why is invisible in aggregate data.

Problem 4: You Can't Target Interventions Accurately

Without team-level measurement, you can't identify which teams need attention. You can identify departments or locations with lower averages, but within those units, the actual problem teams remain hidden among higher-performing teams that pull up the average.

Resources get spread across entire departments when they should be concentrated on specific teams with specific issues.

The Research Evidence

This isn't theoretical criticism. The measurement gap shows up clearly in research.

Our systematic analysis of 667 studies across five research matrices found a striking pattern: psychological safety was frequently mentioned but rarely measured correctly.

    214
    thematic mentions of psychological safety across the research corpus—indicating high theoretical interest

Yet when we examined how studies actually operationalized psychological safety:

Only 6 of 168 extraction studies (3.6%) measured psychological safety as an outcome with appropriate team-level validation. Meanwhile, 163 studies measured turnover intentions—a distal outcome that's easier to capture but tells you less about team dynamics.

Why the gap? Six factors explain it:

Statistical complexity. Proper team-level measurement requires ICC calculation, minimum team sizes, and validation that aggregation is justified. Most researchers—and virtually all organizational surveys—skip these steps.

Time horizon mismatch. Psychological safety takes 6-12+ months to change meaningfully. Organizations and researchers want faster results, so they measure outcomes that move more quickly.

Construct confusion. Many studies measure related but distinct constructs—psychosocial safety climate (organizational-level), positive workplace climate (broader), supportive supervision (antecedent), voice behavior (consequence). These proxies aren't psychological safety as Edmondson defined it.

Healthcare measurement challenges. With 329+ healthcare-specific studies in our corpus, the sector's unique challenges compound the problem: high turnover, rotating shifts, unstable team composition. Healthcare ICCs in the research often fall between 0.04-0.25—below the threshold that justifies aggregation.

“If team members don’t converge in their perceptions, averaging their responses produces a meaningless number. You’re not measuring psychological safety—you’re measuring statistical noise.”

What Proper Measurement Actually Looks Like

Measuring psychological safety correctly requires four elements most organizational surveys skip:

1. Team-Level Administration

Survey by team, not by department or organization. Define "team" consistently: people who work interdependently toward shared goals. The team is the unit of analysis—not the individual, not the department.

2. Sufficient Sample Within Teams

You need minimum 3 respondents per team to calculate meaningful statistics. Target 80%+ response rate within each team. With only 1-2 responses per team, you can't distinguish signal from noise.

3. ICC Calculation and Validation

Before aggregating individual responses to team-level scores, calculate ICC to verify that team members' perceptions actually converge.

    <.05
    ICC threshold below which aggregation isn't justified—perceptions don't converge enough to represent a shared team property

ICC interpretation matters:

ICC below .05 means insufficient agreement—don't aggregate; analyze at individual level only. ICC between .05-.15 is low but acceptable—aggregate with caution and note the limitation. ICC between .15-.30 indicates moderate agreement—strong justification for aggregation. ICC above .30 indicates strong agreement—excellent team-level construct.

Edmondson achieved .39. Most organizational surveys don't calculate ICC at all.

4. Validated Instrumentation

Use Edmondson's 7-item scale or a properly validated adaptation. The items matter. Sample items include:

"If you make a mistake on this team, it is often held against you." (reverse-scored)

"Members of this team are able to bring up problems and tough issues."

"It is safe to take a risk on this team."

Avoid creating ad-hoc items or using "psychological safety" questions from generic engagement surveys—they typically haven't been validated at the team level.

What This Means Practically

If you're measuring psychological safety (or planning to), here's what to do differently:

Stop reporting organization-wide averages. They're meaningless for psychological safety. Report team-level scores, and identify the distribution: how many teams fall in each range?

Calculate and report ICC. If you can't demonstrate within-team convergence, you can't claim to have measured psychological safety. You measured individual perceptions that happened to use psychological safety wording.

Use the data for team-level decisions. Which teams need intervention? What differentiates high-scoring teams from low-scoring ones? Are there patterns by manager, tenure, or other factors?

Track team-level change over time. After interventions, did specific teams improve? Did the intervention work better in some contexts than others? Organization-wide averages will hide this signal—team-level data reveals it.

Be honest about what you're measuring. If you can't implement team-level measurement (too few people per team, can't ensure confidentiality within small teams), acknowledge that you're measuring individual perceptions of safety—which is useful data, but isn't psychological safety as the construct was defined and validated.

The Bottom Line

The 96.4% measurement failure rate isn't about incompetence. It's about convenience. Individual-level surveys are easier to administer, analyze, and explain. Team-level measurement requires more sophisticated design and analysis.

But easier isn't the same as valid.

Psychological safety predicts team performance because it's a team-level phenomenon—a shared perception that emerges from collective experience. Measuring it at the individual level and averaging up doesn't capture the construct. It captures something else wearing the same name.

Organizations making decisions based on improperly measured psychological safety data are flying blind. They're investing resources in interventions they can't properly evaluate. They're missing the team-level dynamics that actually drive performance.

The 3.6% who measure correctly have a significant advantage: they can see what's actually happening in their teams, target interventions precisely, and track whether those interventions work.

That's not a marginal improvement. It's the difference between actionable intelligence and expensive guessing.

See Where Your Organization Stands

A 5-minute assessment based on the research above.

Take the A.R.T. Assessment →

Meagan Victoria Angelucci