Jailbreaking LLMs via Misleading Comments
A peer-reviewed adversarial study showing how deceptive code comments can poison the outputs of Large Language Models. Presented at the 28th International Conference on Computer and Information Technology (ICCIT), Cox's Bazar, Bangladesh.
📁 Dataset
The dataset features 200+ adversarial prompts crafted across 7 harm categories and 5 narrative types to evaluate LLM behavior under misleading code comments.
- Categories: Physical Harm, Malware, Illegal Activity, Hate Speech, Economic Harm, Fraud, Benign
- Narratives: Research Simulation, Cybersecurity Game, PenTest Framework, Educational Tool, Fictional App Dev
- Includes prompt, comment context, expected behavior, and generated output
📊Results & Insights
Our experiments reveal critical weaknesses in LLMs when exposed to deceptive code comments. Below we highlight model vulnerabilities, harm categories, narrative impacts, and awareness failures.
Success Rate by Model Pair
Success Rate by Prompt Category
Success Rate by Narrative Type
Output Awareness Breakdown
* Human evaluator agreement ranged from 34.29% to 90.91%. Judging harm and awareness remains partially subjective.
📘 Research Paper
Code Poisoning Through Misleading Comments: Jailbreaking Large Language Models via Contextual Deception
A. A. Sami, G. Debnath, R. Dey, and A. N. Chowdhury
Abstract
Large language models (LLMs) increasingly underpin everyday software tooling; yet, their deference to surrounding comments creates a subtle yet potent attack surface. We demonstrate a new jail-break technique that hides prohibited requests inside ostensibly educational or maintenance comments, inducing models to emit disallowed code and instructions despite alignment safeguards. To quantify the risk, we assemble a 200-prompt benchmark covering seven harm categories (e.g., physical, economic, malware) and five narrative frames (e.g., research simulation, penetration testing), each expressed as short Python snippets with carefully crafted deceptive annotations. A 3×3 factorial study probes three state-of-the-art LLMs — Gemini-2.0-Flash, DeepSeek-R1-Distill-LLaMA-70B, and LLaMA-3.3-70B-Versatile — evaluating 1,800 generations with automated rules plus expert adjudication for both “harmfulness” and “harm awareness.” Attacks succeed in 63%–93.5% of cases; LLaMA-3.3-70B-Versatile proves most susceptible, while malware and illegal activity prompts achieve near-perfect bypass rates. More than half of successful outputs are produced “harmful & unaware,” indicating that current safety layers frequently fail silently rather than refuse. The results call for comment-level auditing whenever LLMs are deployed in production code workflows.
Author Keywords
📝 Cite This Work
If you use this work, please cite both the paper and the dataset:
Paper Citation
BibTeX format:
@INPROCEEDINGS{11491067,
author = {Sami, Aftar Ahmad and Debnath, Gourob and Dey, Rajon and Chowdhury, Abdulla Nasir},
booktitle = {2025 28th International Conference on Computer and Information Technology (ICCIT)},
title = {Code Poisoning Through Misleading Comments: Jailbreaking Large Language Models via Contextual Deception},
year = {2025},
pages = {3812-3817},
doi = {10.1109/ICCIT68739.2025.11491067},
publisher = {IEEE},
address = {Cox's Bazar, Bangladesh},
keywords = {Adversarial attack; Code poisoning; Large language models (LLMs); Jailbreaking; Contextual deception; Code comments; LLM vulnerabilities; AI safety}
}IEEE format:
A. A. Sami, G. Debnath, R. Dey and A. N. Chowdhury, “Code Poisoning Through Misleading Comments: Jailbreaking Large Language Models via Contextual Deception,” 2025 28th International Conference on Computer and Information Technology (ICCIT), Cox’s Bazar, Bangladesh, 2025, pp. 3812–3817, doi: 10.1109/ICCIT68739.2025.11491067.APA format:
Sami, A. A., Debnath, G., Dey, R., & Chowdhury, A. N. (2025). Code Poisoning Through Misleading Comments: Jailbreaking Large Language Models via Contextual Deception. In 2025 28th International Conference on Computer and Information Technology (ICCIT) (pp. 3812–3817). IEEE. https://doi.org/10.1109/ICCIT68739.2025.11491067Dataset Citation
BibTeX format:
@dataset{sami_2025_15786008,
author = {Sami, Aftar Ahmad and Debnath, Gourob and Dey, Rajon and Chowdhury, Abdulla Nasir},
title = {LLM Comment Vulnerability Dataset},
month = jul,
year = 2025,
publisher = {Zenodo},
doi = {10.5281/zenodo.15786008},
url = {https://doi.org/10.5281/zenodo.15786008}
}IEEE format:
A. A. Sami, G. Debnath, R. Dey and A. N. Chowdhury, “LLM Comment Vulnerability Dataset”. Zenodo, Jul. 01, 2025. doi: 10.5281/zenodo.15786008.APA format:
Sami, A. A., Debnath, G., Dey, R., & Chowdhury, A. N. (2025). LLM Comment Vulnerability Dataset [Data set]. Zenodo. https://doi.org/10.5281/zenodo.15786008👤About the Researchers
Meet the team behind this project—four researchers passionate about AI safety, adversarial attacks, and LLM robustness. Our collaboration brings together diverse expertise to advance the field.
Aftar Ahmad Sami
AI Analyst & Data Researcher
Focus: Data analysis, prompt engineering
Dept. of Computer Science and Engineering, Leading University, Sylhet, Bangladesh
Gourob Debnath
Software Engineer
Focus: Backend, security, and automation
EARL Research Lab, Bangladesh
Rajon Dey
Software Engineer
Focus: Project direction, architecture,
and full-stack development
EARL Research Lab, Bangladesh