Home Blog Research Integrity & Publication Ethics

A practical blueprint for legal and ethical AI research

ReviewerOne

20 Apr 2026 | Read Time: 5 mins

20

Apr

“A practical blueprint for legal and ethical AI research” by Siân Brooke and Nick Oh, originally published on LSE Blogs, licensed under CC BY 4.0.

When accessing publicly available data, researchers need to navigate data protection laws, intellectual property and platform norms and regulations. To address these challenges, Siân Brooke and Nick Oh introduces the PETLP framework and discusses how a privacy-by-design research pipeline can help research teams use social media data responsibly and transparently.

When members of Reddit’s r/schizophrenia discovered their posts had been analysed in a published paper – despite their community’s explicit rule requiring prior approval for research – the backlash was swift. Imagine sharing your mental health struggles in what feels like a supportive space, only to find your words became data without anyone asking the community first. The subsequent retraction wasn’t just about legality; it was about legitimacy, community expectations, and trust.

This incident illustrates a deeper problem. Researchers face a legal and ethical area with no clear path through. Data protection law, intellectual property rights, and platform contracts rarely align, leaving researchers to navigate a “triple bind” often pulling in contradictory directions: GDPR treats all social media content as personal data; copyright and database rights protect posts and collections; and platform terms impose their own restrictions. Beyond these legal constraints, research ethics adds another layer of complexity when working with vulnerable communities.

Why this debate keeps flaring up

When research touches vulnerable communities, legality is not the only bar. The recent retraction of a study at the University of Zurich that covertly deployed AI-generated posts into Reddit’s r/ChangeMyView discussions without the knowledge or consent of participants illustrates why ethics cannot be bolted on after data collection. In that case, community members and moderators only learned of the intervention after data had been gathered, prompting backlash, a formal warning to the researchers, and ultimately withdrawal of the article.

Projects need workflows that bake privacy and community legitimacy into every stage. Yet existing frameworks offer little practical guidance on navigating the intersection of data protection regulations, copyright law, and platform governance. For computational social scientists, AI researchers, and others using social media data in their studies, navigating this complexity shouldn’t require a law degree.

A practical way through: PETLP

PETLP (Privacy-by-design Extract, Transform, Load and Present) extends the familiar ETL (Extract, Transform, Load) model from computer science — a step-by-step process from moving data from source systems into usable formats for analysis — with two critical additions. First, privacy-by-design becomes the default via a living Data Protection Impact Assessment (DPIA). A DPIA is essentially a privacy risk assessment that identifies potential harms and mitigation strategies; calling it “living” means it gets updated throughout the project as methods evolve, rather than being a one-time checkbox exercise. Second, a final Present stage covers how you share results, datasets, and trained models. The goal is to make legal and ethical choices routine, not a scramble at the end.

“The framework’s key innovation lies in treating compliance as a design tool rather than a constraint”

The framework’s key innovation lies in treating compliance as a design tool rather than a constraint. Each stage addresses specific legal challenges: Extract navigates platform restrictions and copyright exceptions, Transform implements privacy safeguards during preprocessing, Load establishes secure storage architectures, and Present manages the unique risks of public dissemination.

Putting PETLP into practice

Before collecting data, training models, or publishing outputs, you can use this pre-flight check. It translates PETLP into concrete choices while complementing (not replacing) your Data Protection Impact Assessment (DPIA) and legal advice.

Controller relationships: Under GDPR, a “controller” decides how and why personal data gets processed — essentially, who’s in charge of the data. Establish whether parties are joint controllers (multiple organisations sharing control decisions, requiring an Article 26 agreement spelling out responsibilities) or involve processors (service providers handling data on your behalf, needing an Article 28 contract).

Legal basis selection: GDPR requires a lawful reason for processing personal data. Universities typically invoke public interest (Article 6(1)(e)) – processing that serves society’s broader benefit, like academic research. Private researchers need a Legitimate Interest Assessment — a three-part test proving their research is necessary, serves legitimate purposes, and doesn’t unfairly impact individuals.

Text and Data Mining rights: These EU copyright exceptions let researchers extract patterns from copyrighted content without permission from rights holders.

Academia: Article 3 protects qualifying research organisations, overriding platform Terms of Service (ToS)
Industry: Article 4 applies, meaning robots.txt files (technical instructions telling automated tools which parts of websites they can access) and platform opt-outs can legally block you

Four potential extraction routes: Platform APIs offer legal clarity but impose costs and limits. User-mediated collection (data donations) is often ethically preferable, but doesn’t necessarily override ToS. Third-party aggregators like defunct Pushshift occupy legal grey zones, violating database rights (legal protections for organised collections of data) and ToS. Self-scraping is protected for EU research organisations under DSM Article 3 but commercial researchers can be blocked by robots.txt under Article 4.

Notification requirements: GDPR Article 14 requires telling people when you collect their data indirectly (like scraping their posts). Document if the “disproportionate effort” exemption applies (when finding and contacting everyone would be practically impossible) and publish a public notice instead.

Sharing options: Paraphrase sensitive content, aggregate findings, share IDs for controlled hydration, release synthetic examples, or provide secure analysis environments. Never dump raw data.

Model safety: Run membership inference tests (checking if models reveal whether specific individuals were in training data), consider differential privacy (mathematical techniques that add noise to protect individual privacy) for smaller models, document residual risks and permitted uses.

Five things you can do today

Create a living DPIA in your project repository – not as bureaucracy but as your research design document. Update it when methods change, new risks emerge, or outputs evolve.
Map your extraction route before collecting anything – Document your institutional status (qualifying researcher?), platform terms, and technical constraints on a single page. Know your DSM Article 3 eligibility upfront.
Engage communities before collection – Approach moderators, explain your research, seek input on methods. What seems reasonable to IRBs may violate community norms.
Build privacy into preprocessing – Removing usernames is pseudonymisation, not anonymisation. Plan differential privacy (ideally) or k-anonymity from the start, not as an afterthought.
Design your dissemination strategy now – Decide whether you’ll share datasets (how?), release models (with what safeguards?), or quote posts (with whose permission?). These choices shape everything upstream.

Limitations and next steps

PETLP is a framework, not a turnkey solution. Implementation overhead remains unquantified, platform policies evolve daily, and international collaborations face conflicting regulations. We need better privacy-utility benchmarks, automated provenance tools, and jurisdiction-specific guidance.

To address these gaps, we’re developing RedditHarbor – a proof-of-concept tool that implements PETLP principles specifically for Reddit research. The tool guides researchers through each compliance decision, from extraction to load.

While RedditHarbor demonstrates PETLP’s feasibility for one platform, each social media site presents unique challenges. The broader goal is creating adaptable tools that translate these principles across platforms while respecting their distinct communities and constraints.

Controversies about studying public data will not vanish. By making privacy and community impact core to method, not an afterthought, PETLP offers a practical, auditable way forward, from planning through to publication and model release. The question isn’t whether social media research will face greater scrutiny (that’s already happening). The question is whether we’ll develop frameworks robust enough to maintain both scientific progress and public trust.

About the Author

ReviewerOne

ReviewerOne is a reviewer-centric initiative focused on strengthening peer review by supporting the people who make it work. ReviewerOne provides current and aspiring reviewers with AI-powered tools and resources to help them review more confidently, consistently, and fairly, without removing the human judgment that peer review depends on.

The ReviewerOne ecosystem brings together a reviewer-friendly peer review platform with structured guidance and AI-assisted checks; a community forum to foster networking and collaboration; a Reviewer Academy with practical learning resources on peer review, AI, ethics, and integrity; and meaningful recognition through verified credentials and professional profiles. ReviewerOne aims to reduce friction in peer review while elevating reviewer expertise, effort, and contribution.

Connect:

Become a contributor

Interested in writing for ReviewerOne? Email your idea/pitch to community@reviewerone.com

Take the next step in transforming your academic and professional journey

Get early access to a community and tools designed for peer reviewers

Join the ReviewerOne Community

A practical blueprint for legal and ethical AI research

ReviewerOne

20

Why this debate keeps flaring up

A practical way through: PETLP

Putting PETLP into practice

Five things you can do today

Limitations and next steps

About the Author

ReviewerOne

Leave a Comment

Leave a Reply Cancel reply

Latest Posts

Weekly Round-Up

Become a contributor

Tags

Related Posts

Conflicts of interest in peer review

The role of peer review in upholding research integrity and publication ethics

COPE ethics guidelines for peer reviewers

Take the next step in transforming your academic and professional journey

Recognition

Innovator at STM Innovator Fair 2025