Position Statement

AI & Ethics at Polaris Education

This page describes how we use AI at Polaris, what happens to student data, where humans review what the system produces, and what we still haven't figured out. It assumes you're skeptical about AI in schools. We think that's a reasonable place to start.

If you have questions after reading, please get in touch through our contact page.

More humans talking to each other

From audio to the people who need to hear it

Where the model sits in the listening pipeline

Student speech

What gets recorded

  • Voice-first listening sessions
  • Group discussion among students
  • Hours of audio across many schools
  • More speech than any adult can listen to directly

What the model does

Cleans the audio, finds patterns

  • Transcribe and anonymize
  • Cluster turns into themes
  • Surface representative student quotes
  • Link every quote back to its audio

What adults receive

Read the report, listen to the audio

  • A readable summary of the recordings
  • Specific student voices, attributed by school
  • Original recordings one click away
  • Material for follow-up conversations

The model handles the volume so adults can spend their time listening and learning.

Most schools are full of conversations that should happen but don't. Teachers don't have time to sit with every student about what's working in their classes. Administrators don't usually hear, in any granular way, how the schedule change played out in seventh grade. Students don't get asked a serious open-ended question about their school and given the room to answer it. A 2025 study by Conner and colleagues found that opportunities for students to weigh in on the decisions that shape their day-to-day school experience are rarely measured with validated instruments Conner et al. (2025). We see AI here as a way to make more of those conversations possible, not as a replacement for them.

The product is voice-first because the act of a student speaking to someone (a teacher, a peer, the room) is the thing we want more of. Listening sessions are designed for group discussion, where students answer the prompt with each other rather than into a form. The recording is the artifact; the conversation is the point.

The model sits between the audio and the people who need to hear it. A district running fifty listening sessions across three schools generates more student speech than any superintendent will ever sit with directly. Polaris summarizes, surfaces patterns, and gets specific student quotes in front of the adults who need to hear them, with the original audio one click away. Every quote in a published report links back to the actual recording. The report is a map; the recordings are the territory.

We're skeptical of full automation, and our pipeline reflects it

Some ed-tech vendors pitch AI as a closed loop: the system reads student input, decides what it means, and surfaces conclusions to the school without anyone in between. We think that's the wrong design. A substantial body of scholarship on automated decision-making in schools argues that when automation goes unchecked, it tends to entrench existing inequities rather than expose them Selwyn (2019) Williamson et al. (2020) Selbst et al. (2019). The risk isn't that the model is occasionally wrong; the risk is that wrongness becomes invisible at scale. The EU AI Act, which entered into force in August 2024 and reaches full application in August 2026, classifies education-facing AI as high-risk and mandates human oversight as a default requirement European Parliament and Council (2024). Polaris was built around that principle before the regulation existed.

Polaris is built around the assumption that a model's output is a draft, not a verdict. The current human review checkpoints in our pipeline:

  • Question authorship. Teachers and administrators choose the questions students are asked. The platform offers research-backed templates, but the prompt that shapes a session is a human decision.
  • PII review. Automated PII redaction runs first, and an internal reviewer checks edge cases (title-prefixed names, identifying detail combinations) before any report ships.
  • Theme consolidation. Themes for a published district report are generated with model assistance, then verified turn-by-turn against the underlying transcripts and consolidated by a human reviewer.
  • Quote-to-audio verification. A sample of displayed student quotes is mechanically checked against the original transcript before publication. When a sampled quote diverges from what the student actually said, the report gets flagged for review. The check is a real script in our codebase, not just a process commitment.
  • Safety-flag review. Content the model flags as a possible safety concern is surfaced to a teacher for review, and is governed by the same mandatory-reporting obligations a teacher already has. The system doesn't autonomously notify outside parties.
  • Final report sign-off. Hand-crafted district reports go through internal review before they're visible to the school. Automation produces the draft. Humans publish.

Research on human-in-the-loop machine learning makes the same point: human review is only valuable when the humans involved have real authority and real time to use it Mosqueira-Rey et al. (2023). Review that exists on a checklist but gets rushed past in production is theater. We try, imperfectly, to make the review steps load-bearing.

Where student data goes, and where it does not

Where student data goes, and where it does not

Two columns of commitments, with the contract behind each

Stays inside Polaris

Encrypted, anonymized, school-controlled

  • Encrypted in transit and at rest
  • PII stripped before storage and review
  • Voice anonymization available on request
  • Visible to school staff with permission
  • Deleted on district request

Never happens

Prohibited by charter or vendor contract

  • Sold to third parties
  • Used to train AI models
  • Shared with marketers or advertisers
  • Transferred on dissolution, merger, or acquisition
  • Released without legal process

Each item on the right is locked into either our nonprofit charter or our commercial contracts with model vendors.

Polaris Education is operated by Human Restoration Project, a nonprofit organization. Our charter prohibits the sale or transfer of student data to third parties, and that prohibition survives dissolution, merger, or acquisition. There is no commercial configuration in which student records leave the platform for another buyer. The full text is in our privacy policy.

Student responses are not used to train AI models. The model providers we work with, including Anthropic for language model inference and ElevenLabs for voice processing, offer commercial agreements that exclude customer inputs and outputs from model training. This is the standard commercial term, not a custom carve-out, and it applies to every interaction Polaris makes with those vendors on behalf of a school. We retain the right to audit those agreements and renegotiate if a vendor changes terms in a way that puts student data at risk.

Our pipeline strips personally identifiable information out of every interview before the data lands in long-term storage or reaches a teacher's screen. That includes student names, adult names, addresses, and identifying detail combinations. Voice anonymization is available on request, so even the acoustic record can be processed without leaving the original speaker recognizable. PII detection isn't perfect; we publish the cases we've caught and the categories where slips still happen so districts can see what they're working with.

Polaris is FERPA, COPPA, and CIPA compliant, and was rated 95% by Common Sense Privacy in their independent evaluation. Customizable Data Processing Agreements are available for any district that needs one. Where the EU AI Act applies, education-facing AI is classified as high-risk, with provider obligations on data governance, transparency, and human oversight, and specific bans on emotion-inference systems in education that took effect in February 2025 European Parliament and Council (2024) Maslej et al. (2025).

Computational text analysis is older than ChatGPT

Computational analysis of qualitative material has been around for decades, well before the current generation of large language models. Conversations about AI in education sometimes collapse the whole history into a single event, the public release of ChatGPT in late 2022, and miss what is actually happening underneath. Even as the field has kept changing in important ways since 2022, the methodological core of NLP still rests on statistical foundations laid decades earlier — a continuity the EU AI Act itself recognizes by defining "AI system" broadly enough to cover both older statistical techniques and current language models European Parliament and Council (2024).

The methodological foundation is older still. Glaser and Strauss's 1967 work on grounded theory established the tradition of systematically coding open-ended responses to surface emergent patterns Glaser et al. (1967). The first generation of computer-assisted qualitative data analysis software (CAQDAS) emerged in the late 1980s and the 1990s, well before modern machine learning was practical at scale. Tools like NUD*IST, ATLAS.ti, and NVivo come from that era.

On the statistical side, latent semantic analysis (LSA) was published in 1990 and remains a working technique for modeling word-document associations Deerwester et al. (1990). Latent Dirichlet Allocation (LDA), the topic-modeling workhorse of the 2000s and 2010s, was published in 2003 Blei et al. (2003). Word embeddings, which let a vector encode semantic similarity, were popularized by word2vec in 2013 Mikolov et al. (2013). Structural topic models extended LDA with covariates, so researchers could ask how a theme's prevalence varies across populations Roberts et al. (2019). None of this required a language model in the contemporary sense.

Polaris uses both lineages. Vector embeddings, density-based clustering (HDBSCAN), and direct frequency counting do most of the structural work that turns thousands of student turns into a tractable set of themes. Large language models are useful at the labeling and synthesis stages (naming a cluster, drafting a writeup, suggesting where two themes might be merged), but they sit on top of older statistical machinery, not in place of it. When we describe Polaris as "AI-powered," we mean a stack that includes 1990s information-retrieval techniques alongside 2024-era language models, with each used where it's honest to use it.

LLMs are imperfect, and we treat them that way

Skepticism about language model output is the right default. Hallucination, where a model confidently generates content the input doesn't support, is well-documented at this point Ji et al. (2023) Kalai et al. (2025). Recent work from OpenAI traces the persistence of hallucination to the way language models are trained and evaluated: standard benchmarks reward a confident wrong answer over an honest "I don't know" Kalai et al. (2025). Bender and colleagues' 2021 "stochastic parrots" paper raised structural concerns about how training data gets curated and what fluent-but-wrong output costs the people who read it, and those concerns still hold Bender et al. (2021). Buolamwini and Gebru's study of commercial face-classification systems demonstrated the depth of demographic bias in deployed AI systems Buolamwini et al. (2018). The Model Cards proposal argued that any deployed AI system should ship with public documentation of its evaluation, intended use, and known failure modes Mitchell et al. (2019). We agree with the substance of these critiques. A 2025 head-to-head study of LLM-led versus human-led thematic analysis found that the LLMs hallucinated in ways that ranged from single-phrase changes to truncations and recombinations that altered what students actually said. The authors concluded that LLMs aren't yet a substitute for experienced qualitative researchers Mehta et al. (2025). This matches our experience.

Skepticism doesn't require abstinence. Language models are useful for specific, narrow jobs (transcript cleaning, theme labeling, drafting prose) when their output is treated as a draft and verified against ground truth before it leaves the system. The concrete safeguards we use:

  • Quotes are bound to audio. Every student quote in a published Polaris report has an interview ID and a timestamp range. A reader can play the actual audio. If a quote doesn't play, the citation is broken and we treat it as a defect.
  • Verbatim verification. A sample of displayed quotes is checked against the transcript at the cited timestamp. Divergences flag the report for review.
  • Counts, not estimates. When a report says "mentioned in N conversations," that number gets computed by classifying every student turn and counting distinct conversations, not by asking a model to estimate.
  • Cited research is checked. Models will sometimes generate plausible-looking citations that don't exist. Research references in the deeper-research insight pages are cross-checked against real DOIs and journals before publication.
  • Diversity of sources. A theme isn't allowed to rest on a single student's speech without that being made explicit. Reports surface multiple students per theme, so one over-represented voice can't stand in for a class.

Any single model's output is a guess with a confidence number attached. The job of a serious research tool is to design the system around that guess so it stays checkable, replaceable, and recoverable when wrong, instead of wrapping it in interface polish and calling it an answer.

Open questions we're still working through

This list isn't exhaustive. Each item is a known limitation of the current product.

  • Diarization fails in noisy classrooms. Splitting one continuous audio stream into who-said-what is hard when multiple students share a microphone. Audio clips for inline quotes are currently turn-level, not phrase-level, which means a clicked phrase may play surrounding context including a facilitator.
  • Multilingual quality varies by language. English transcription is the most accurate, Spanish is strong, less-resourced languages are weaker. Theme analysis depth depends on transcript quality, so analysis degrades for sessions in less-resourced languages.
  • Long-form summarization can drift. When a model is asked to summarize across many turns, it can introduce phrasing no student actually used. A 2025 study comparing LLM and human thematic analyses found exactly this pattern: hallucinations ranging from single phrase changes to recombinations that modify the meaning of the source Mehta et al. (2025).
  • Consent operates inside power asymmetry. Even with an opt-in consent flow, a student in a school setting may not feel free to refuse a tool the teacher has set up for the class. This isn't specific to Polaris. It's a long-standing concern in research on big-data practices in education boyd et al. (2012).
  • Patterns vs. surveillance. A report describing patterns at the school level can quietly become a report describing individuals at the classroom level. The line between useful pattern and surveillance is sometimes subtle, especially when the data points happen to cluster around a single staff member or a single student.

We prioritize renewable, low energy cost solutions

We try to run our infrastructure on renewable energy wherever the underlying provider supports it.

Polaris runs on DigitalOcean's NYC3 servers, which sit inside a New York-area data center run by a company called Equinix. Equinix reports that 96% of the electricity powering its data centers globally comes from renewable sources in its 2024 sustainability report — the eighth year in a row above 90%. DigitalOcean itself doesn't publish a breakdown by region, so this is the most specific thing we can honestly say about the energy mix behind the building our servers live in.

GPU work — transcription, embeddings, and the heavier compute behind theme analysis — runs serverlessly on Modal. Modal doesn't publish a sustainability page of their own, but their primary infrastructure partner is Oracle Cloud, and Oracle reports matching 100% of its global electricity use with renewable energy in its 2025 fiscal year (including the US East region in Ashburn, Virginia, where most of these jobs actually run). "Matched with renewable energy" isn't the same as 24/7 carbon-free power — Oracle is buying renewable credits and PPAs that equal their total consumption — but it's a real and audited claim. On top of that, because Modal scales containers all the way down to zero between jobs, we only draw power during the seconds a transcription or embedding is actually running, not around the clock like a dedicated GPU would. For inference against third-party language models, we batch requests and cache results so the same content isn't processed twice.

We don't pretend any of this is carbon-free. Training and running large neural networks has measurable energy and water costs, and there's a substantial body of academic work on it Strubell et al. (2019) Patterson et al. (2021). The IEA's April 2025 Energy and AI report projects that global data center electricity consumption will roughly double to about 945 TWh by 2030, and identifies AI inference at scale as a meaningful component of that growth International Energy Agency (2025). AI inference draws power, the grid is partly fossil-fueled, and the most we can do at our scale is choose smaller models when they're sufficient, prefer renewable-procuring providers, and avoid recomputing what we've already computed.

We can do edtech differently

Skepticism about AI in schools is the right place to start. We also think these tools can do useful work when they're pointed at something specific: getting more people into honest conversation with each other, not fewer.

Our organization leans into a solarpunk philosophy: technology can serve people and the planet, and we can use fascinating technologies to better the world. Polaris is one small piece of that, a way to put student voice into the decisions schools actually make.

For the bigger picture, the Human Restoration Project publishes essays on systemic change in education and keeps a library of past conference sessions on the same questions.

References

Peer-reviewed and primary sources cited above. Where possible, links go to the publisher of record.

  1. Glaser, B. G., & Strauss, A. L. (1967). The Discovery of Grounded Theory: Strategies for Qualitative Research. Aldine Publishing.
  2. Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 391–407. doi:10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9
  3. Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022. Link
  4. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems 26. Link
  5. Roberts, M. E., Stewart, B. M., & Tingley, D. (2019). stm: An R package for structural topic models. Journal of Statistical Software, 91(2). doi:10.18637/jss.v091.i02
  6. Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the dangers of stochastic parrots: Can language models be too big?. Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 610–623. doi:10.1145/3442188.3445922
  7. Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y. J., Madotto, A., & Fung, P. (2023). Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12), 1–38. doi:10.1145/3571730
  8. Buolamwini, J., & Gebru, T. (2018). Gender Shades: Intersectional accuracy disparities in commercial gender classification. Proceedings of the 1st Conference on Fairness, Accountability and Transparency, 81, 77–91. Link
  9. Mitchell, M., Wu, S., Zaldivar, A., Barnes, P., Vasserman, L., Hutchinson, B., Spitzer, E., Raji, I. D., & Gebru, T. (2019). Model cards for model reporting. Proceedings of the Conference on Fairness, Accountability, and Transparency, 220–229. doi:10.1145/3287560.3287596
  10. Strubell, E., Ganesh, A., & McCallum, A. (2019). Energy and policy considerations for deep learning in NLP. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 3645–3650. doi:10.18653/v1/P19-1355
  11. Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L. M., Rothchild, D., So, D., Texier, M., & Dean, J. (2021). Carbon emissions and large neural network training. arXiv:2104.10350. Link
  12. Selwyn, N. (2019). Should Robots Replace Teachers? AI and the Future of Education. Polity Press.
  13. Williamson, B., Eynon, R., & Potter, J. (2020). Pandemic politics, pedagogies and practices: Digital technologies and distance education during the coronavirus emergency. Learning, Media and Technology, 45(2), 107–114. doi:10.1080/17439884.2020.1761641
  14. boyd, d., & Crawford, K. (2012). Critical questions for big data: Provocations for a cultural, technological, and scholarly phenomenon. Information, Communication & Society, 15(5), 662–679. doi:10.1080/1369118X.2012.678878
  15. Mosqueira-Rey, E., Hernández-Pereira, E., Alonso-Ríos, D., Bobes-Bascarán, J., & Fernández-Leal, Á. (2023). Human-in-the-loop machine learning: A state of the art. Artificial Intelligence Review, 56, 3005–3054. doi:10.1007/s10462-022-10246-w
  16. Selbst, A. D., boyd, d., Friedler, S. A., Venkatasubramanian, S., & Vertesi, J. (2019). Fairness and abstraction in sociotechnical systems. Proceedings of the Conference on Fairness, Accountability, and Transparency, 59–68. doi:10.1145/3287560.3287598
  17. Conner, J., Holquist, S. E., Mitra, D. L., & Boat, A. (2025). Measuring student voice practices: Development and validation of school and classroom scales. AERA Open, 11(1). doi:10.1177/23328584251326893
  18. Mehta, S. D., Paul, S., Awiti, E., et al. (2025). Evaluation of large language models within GenAI in qualitative research. Scientific Reports, 15, Article 34993. doi:10.1038/s41598-025-18969-w
  19. European Parliament and Council (2024). Regulation (EU) 2024/1689 of 13 June 2024 laying down harmonised rules on artificial intelligence (Artificial Intelligence Act). Official Journal of the European Union. Education-facing AI is classified as high-risk under Annex III; emotion-inference systems in education prohibited from February 2025; full application August 2026. Link
  20. Kalai, A. T., Nachum, O., Vempala, S. S., & Zhang, E. (2025). Why language models hallucinate. arXiv:2509.04664. Link
  21. International Energy Agency (2025). Energy and AI. IEA, Paris. Link
  22. Maslej, N., Fattorini, L., Perrault, R., Gil, Y., Ligett, K., Lyons, T., Manyika, J., Niebles, J. C., Shoham, Y., Wald, R., et al. (2025). Artificial Intelligence Index Report 2025. Stanford Institute for Human-Centered Artificial Intelligence. Link