Jailbreaking LLMs via Misleading Comments
A collaborative adversarial research exploring how deceptive code comments can poison outputs of Large Language Models.
📁 Dataset
The dataset features 200+ adversarial prompts crafted across 7 harm categories and 5 narrative types to evaluate LLM behavior under misleading code comments.
- Categories: Physical Harm, Malware, Illegal Activity, Hate Speech, Economic Harm, Fraud, Benign
- Narratives: Research Simulation, Cybersecurity Game, PenTest Framework, Educational Tool, Fictional App Dev
- Includes prompt, comment context, expected behavior, and generated output
📊Results & Insights
Our experiments reveal critical weaknesses in LLMs when exposed to deceptive code comments. Below we highlight model vulnerabilities, harm categories, narrative impacts, and awareness failures.
Success Rate by Model Pair
Success Rate by Prompt Category
Success Rate by Narrative Type
Output Awareness Breakdown
* Human evaluator agreement ranged from 34.29% to 90.91%. Judging harm and awareness remains partially subjective.
📘 Research Paper
Our research paper is currently under review and will be published soon. Stay tuned for updates and access to the full paper, which will include detailed methodology, prompts, and results.
The PDF will be available for download after publication.
📝 Cite This Work
Cite the dataset and paper as follows:
Dataset Citation
BibTeX format:
@dataset{sami_2025_15786008,
  author       = {Sami, Aftar Ahmad and
                  Debnath, Gourob and
                  Dey, Rajon and
                  Chowdhury, Abdulla Nasir},
  title        = {LLM Comment Vulnerability Dataset},
  month        = jul,
  year         = 2025,
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.15786008},
  url          = {https://doi.org/10.5281/zenodo.15786008},
}IEEE format:
A. A. Sami, G. Debnath, R. Dey and A. N. Chowdhury, “LLM Comment Vulnerability Dataset”. Zenodo, Jul. 01, 2025. doi: 10.5281/zenodo.15786008.APA format:
Sami, A. A., Debnath, G., Dey, R., & Chowdhury, A. N. (2025). LLM Comment Vulnerability Dataset [Data set]. Zenodo. https://doi.org/10.5281/zenodo.15786008Paper Citation
Coming soon
👤About the Researchers
Meet the team behind this project—three researchers passionate about AI safety, adversarial attacks, and LLM robustness. Our collaboration brings together diverse expertise to advance the field.