System Prompt Linter

    System Prompt Linter

    Grade and improve your LLM system prompt — clarity, structure, token bloat, common anti-patterns

    Grade
    F
    Errors
    1
    Warnings
    4
    Info
    1
    Passing
    2
    Length
    Compact prompt (~52 tokens).
    Role
    Role is defined.
    Vague language
    Marketing boilerplate that models already absorb from training; spend tokens on domain-specific constraints instead.
    "helpful, harmless, and honest"
    Vague language
    Vague. Specify: tone (friendly/formal/terse), audience (developers/customers), response style (bullet/prose).
    "be nice"
    Vague language
    Unreliable in-context retrieval. State the rule explicitly each place it applies, or put it behind an XML tag the model can re-read.
    "remember"
    Vague language
    Generic. Specify the domain: 'You are a TypeScript refactoring assistant.' increases rule adherence.
    "ai assistant"
    Jailbreak smell
    Prompt invites the model to disregard its safety training. This rarely works and triggers RLHF refusals; replace with specific allowlists of what IS permitted.
    Style
    Missing apostrophes in contractions ('dont', 'cant'). Models mimic your writing style — sloppy input = sloppy output.

    About the System Prompt Linter

    A system prompt is the most underrated piece of code in most AI products. Subtle failures there cascade into expensive evaluation loops downstream. This linter runs 12+ structural and semantic checks in the browser and surfaces issues before you ship — covering length, structure, vagueness, contradictions, and output-format specification.

    Features

    How it works

    1. Paste your system prompt.
    2. Review the grade (A–F) and findings broken down by rule.
    3. Fix the errors first, then warnings — each has a concrete suggestion.

    Use cases

    Frequently asked questions

    Will a high score guarantee a better model?

    +

    No — prompts are empirical. This linter catches common footguns, but the final measure is evaluation on your actual task. Use it as a pre-flight check, not a substitute for evals.

    Are the rules opinionated?

    +

    Yes — they reflect 2024-2026 best practice from Anthropic's Claude prompting guide, OpenAI's GPT-4 system message docs, and widely-reported patterns from production prompt engineering teams.

    Why does it flag 'helpful, harmless, and honest'?

    +

    That phrase is already heavily embedded in RLHF training. Repeating it rarely changes behavior — spend those tokens on domain-specific rules the model doesn't already know.

    How do XML tags help?

    +

    Claude was trained on XML-structured examples and reliably respects <instructions>, <context>, <examples>, <output_format> sections. GPT-4 benefits modestly; Gemini benefits less but still not negatively.

    Is my prompt sent to a server?

    +

    No — all linting runs in your browser.