The Online Safety Act 2023 represents the most significant shift in UK platform regulation in a generation. For any business operating a UK-accessible online service that hosts user-generated content — whether a social platform, forum, marketplace, dating service, or gaming product — it imposes legal duties to identify, assess, and mitigate risks from illegal and harmful content.
At any meaningful scale, those duties cannot be discharged by human moderators alone. AI classification is not merely an operational convenience; for regulated platforms, it is a practical necessity. This article covers how to build AI moderation systems that are effective, transparent, and consistent with Ofcom's published guidance.
What the Online Safety Act Requires
The OSA creates a tiered framework. All in-scope services must conduct a Children's Risk Assessment and an Illegal Content Risk Assessment, implement proportionate safety measures, and maintain records demonstrating compliance. Category 1 services (the largest platforms) face additional transparency and accountability requirements.
Ofcom's Codes of Practice, which began coming into force from late 2024, set out the specific measures Ofcom considers appropriate. They include proactive technology for detecting certain categories of illegal content — particularly child sexual abuse material (CSAM) — and systems for managing user reports effectively. The key principle is that platforms must be able to demonstrate that their moderation systems are proportionate to the risks they have identified.
Critically, Ofcom has made clear that automated systems must be complemented by human review processes. Purely algorithmic enforcement without human oversight is not considered adequate, particularly for high-stakes decisions such as account suspensions or the removal of content that may have legitimate news or educational value.
AI Classification: What It Can and Cannot Do
Modern AI content moderation operates across several modalities:
- Text classification: Detecting hate speech, harassment, spam, self-harm content, and extremist material in written posts, comments, messages, and profiles
- Image and video analysis: Identifying nudity, violence, CSAM indicators, and other visual policy violations
- Audio analysis: Transcription and classification of voice content
- Behavioural signals: Unusual account activity, coordinated inauthentic behaviour, and spam patterns that indicate policy violations even without examining content directly
AI classification excels at scale and consistency. A well-tuned model will apply the same threshold to the millionth piece of content as it did to the first, without fatigue, without bias drift, and at a cost per item that is orders of magnitude lower than human review. It can operate in real time, blocking content before it is published rather than after.
What AI does less well is contextual nuance. Satire, journalism, academic discussion, and counter-speech can all surface in classifiers trained to detect harmful content. A post quoting hateful rhetoric in order to critique it may be flagged by the same model that would flag the original statement. This is where human-in-the-loop processes are essential.
Accuracy vs Recall: The Core Trade-off
Every content moderation system must grapple with the fundamental tension between precision (only flagging content that is genuinely harmful) and recall (catching all genuinely harmful content). Adjusting the classifier threshold in either direction changes this balance.
A low threshold catches more harmful content but generates more false positives — legitimate content incorrectly flagged. A high threshold reduces false positives but allows more harmful content through. The right calibration depends on the nature of the content category and the consequences of each type of error.
For CSAM, the threshold should be set to maximise recall — the cost of a missed positive is severe, and false positives can be reviewed by humans before any punitive action is taken. For spam, a higher-precision threshold may be appropriate to avoid disrupting legitimate users.
Ofcom's guidance does not prescribe specific accuracy thresholds but does expect platforms to document their calibration decisions and the rationale behind them as part of their risk assessment and compliance record.
Building a Human-in-the-Loop Architecture
An effective AI moderation architecture typically operates in three tiers:
Tier 1: Automatic action
Content that scores above a high-confidence threshold for clear policy violations (CSAM, extreme violence, known spam patterns) is removed automatically and immediately. No human review before removal, but all decisions are logged and potentially reviewable after the fact.
Tier 2: Human review queue
Content that scores above a lower threshold but below the automatic action threshold is held pending human review. Moderators work a prioritised queue, with highest-severity classifications surfaced first. The AI provides the classification rationale and confidence score alongside the content to support the moderator's decision.
Tier 3: Reactive only
Content that scores below the review threshold is published. User reports and behavioural signals continue to surface this content for review if it generates complaints or exhibits suspicious engagement patterns.
This architecture ensures that human judgement is applied to the cases where it adds most value — the ambiguous middle ground — rather than to every piece of content or only to the most extreme cases.
Appeal Workflow Automation
The OSA requires in-scope services to provide users with a means to appeal content moderation decisions. Managing appeals manually at scale is not feasible for most platforms. Automation can handle:
- Immediate acknowledgement of appeal submissions
- Routing to the appropriate review tier based on the original decision type
- Re-classifying the content with additional context provided by the user
- Automated responses for appeals against decisions made with very high confidence
- Escalation to senior human reviewers for appeals involving complex context
Appeals data is also a valuable feedback loop for improving your classifiers. Successful appeals — where human reviewers overturn automated decisions — are training signal indicating where your model's calibration needs adjustment.
Transparency Reporting Automation
Category 1 and Category 2 services are required to publish transparency reports disclosing their moderation activity. Even for smaller platforms, maintaining internal reporting is essential for demonstrating compliance to Ofcom. Automated reporting pipelines can generate these reports from your moderation data: volumes of content reviewed, action rates by category, appeal outcomes, and false positive rates. Built correctly, these run on a schedule and require no manual compilation.
Available Tools and Integration Options
The main commercial options for UK platforms are:
- AWS Rekognition: Strong image and video moderation, CSAM detection via PhotoDNA integration, well-suited to media-heavy platforms
- Azure Content Moderator / Azure AI Content Safety: Text, image, and video classification with granular category control, good UK data residency options
- Google Cloud Vision Safe Search / Perspective API: Strong for toxicity and hate speech in text
- Custom models: Fine-tuned on your platform's specific community norms and content types, required when off-the-shelf classifiers are insufficiently accurate for your use case
Most platforms use a combination: commercial APIs for well-defined categories (CSAM, explicit content) and custom models for nuanced community-specific classification tasks.
Data-Related Complaints and the ICO
Where content moderation involves processing personal data — which it almost always does — the ICO's expectations apply alongside Ofcom's. Automated moderation decisions that significantly affect users may trigger Article 22 GDPR rights around automated decision-making. Your moderation system should be designed so that consequential decisions (account suspension, content removal) involve human review, and users are informed of this. Document your lawful basis for processing and your DPIA for each moderation system.
Where to Start
If you are building your first AI moderation system, start with your highest-volume, highest-risk content category. For most platforms, that means text moderation for hate speech and harassment — the category most likely to generate user complaints and Ofcom scrutiny, and the one where commercially available classifiers are most mature. Get that tier running with a proper human review queue before expanding to other modalities. Compliance is about demonstrable, proportionate systems — not perfection.