Filtering ancient Drupal comments for spam using an LLM
I have been meaning to resurrect my blog for a long time. I’ve written many versions of this post, whether on dev.to, wordpress.org, Obsidian Publish, etc. etc. etc. I have spent hundreds of dollars on the question “how do blog?!” and I’m back to “Pay $5/mo for a VPS, put HTML files on the internet.”
I have an old Drupal 5 blog from ~2014 which I migrated my ~2010 Drupal 4 blog posts from. I really liked the 2014 era content from when I first started experimenting with software defined radio and had some moderately well-commented posts on GQRX SDR and Linux gaming at the time.
Aaaaanyway. I have a 19k comments in my Drupal 5 blog and only about 20 that I remember being relevant or interesting and worth preserving.
TLDR: Check my Github Repo#
I have created a repo to convert my ancient MySQL database dumps containing all the content of my Drupal 5 blog. This is a work in progress.
Restoring 2010s Drupal 5 blog to modern tech stack
- Converting the Drupal 5 blog posts to Markdown to use with my Hugo blog
- Filtering the comments using Ollama to find the very small number of non-spam comments
Local-only#
I strongly prefer to run a local LLM when possible. Ollama makes this shockingly easy. For this article I will stick to local-only models
- Google’s new Gemma3 (Just found out about this, let’s try it)
- llama 3.2 (Actually, I’ve been using the vision model for text stuff, but this exists and I’ll give it a try)
- Deepseek R1
Example spam comments#
Here are some obvious spam comments. Yes, there’s swearing and such. Rating: PG-16
*************************** 8. row ***************************
comment_subject: Good evening! you look: - DVD
comment_username: aperoiree
comment_user_email: [email protected]
comment_user_homepage_url: http://gay.eurovids.us/
comment_content: Good evening! you look: - DVD <a href=http://gay.eurovids.us/>ass and dick gay </a> - gay men in suits or DVD <a href=http://gay.eurovids.us/>gay clint taylor </a> - working stiff dvd gay , <a href=http://gay.eurovids.us/>free gay video sex clips </a> - gay fuck legs up tgp ...!!! Let's keep in touch
*************************** 9. row ***************************
comment_subject: It's hard to find experienced
comment_username: bridesmaid shoes store
comment_user_email: [email protected]
comment_user_homepage_url: http://www.sederhana.jp/user/view/profile/login/thfdoming
comment_content: It's hard to find experienced people in this particular topic, however, you sound like you know what you'гe talking аbout!
Thanκs
*************************** 10. row ***************************
comment_subject: Rеmагκable thіngѕ heгe.
comment_username: bridal Shoes Shop
comment_user_email: [email protected]
comment_user_homepage_url: http://www.compilots.com/topsites/index.php?a=stats&u=sonjapitcher
comment_content: Rеmагκable thіngѕ heгe.
I’m very glad to look your post. Thanks a lot and I am taking a look forward to contact you. Will you please drop me a e-mail?
LLM Prompt for spam identification#
I asked Copilot to generate me a prompt to help with the spam detection. It was pretty good off the bat and works well enough. Perfect is the enemy of good, so I’ll keep something that is working
You are evaluating blog comments to detect spam. Analyze the following comment
and determine if it's spam.
Comment details:
- Subject: {}
- Username: {}
- Email: {}
- Homepage URL: {}
- Content: {}
Look for these spam indicators:
1. Excessive or irrelevant links
2. Generic, unrelated content
3. Promotional language unrelated to the post
4. Suspicious URLs or email patterns
5. Mismatched username/email combinations
6. Nonsensical text or keyword stuffing
Respond with ONLY ONE of these two statements:
- "SPAM" if you determine this is likely spam
- "NOT_SPAM" if you believe this is a legitimate comment
Your analysis:
Performance#
Google’s Gemma3#
Model: Google’s Gemma3 c0494fe00251
Sample of 1900/~19,000 (10%)
=== Performance Summary ===
Model: gemma3
=== Performance Statistics ===
Total comments processed: 1900
Total execution time: 513.862 seconds
Average time per comment: 0.270 seconds
Average content length: 1665.28 characters
Min execution time: 0.092 seconds
Max execution time: 10.471 seconds
Correlation between content length and execution time: 0.127
=== Classification Results ===
SPAM: 1889 (99.4%)
ERROR: 6 (0.3%)
NOT_SPAM: 5 (0.3%)
=== Detailed Results ===
Comment ID Length Execution Time (s) Result
0 1 2292 7.420324 SPAM
1 2 201 0.149500 SPAM
2 3 1866 0.330189 SPAM
3 4 552 0.232348 SPAM
4 5 1450 0.244984 SPAM
... ... ... ... ...
1895 1896 171 0.140129 SPAM
1896 1897 648 0.187439 SPAM
1897 1898 764 0.169726 SPAM
1898 1899 670 0.169026 SPAM
1899 1900 586 0.173447 SPAM
Llama 3.2 - 3B text-only#
Model: Llama 3.2 3B text-only a80c4f17acd5
Llama 3.2 decided to ignore my prompt and output a block of reasoning along with its score of spam.
I didn’t account for this, so I ended up losing my console output to the text from 1900 responses from the model filling up my terminal buffer. Oops!
This took ~15 minutes, seemingly because the model was generating SO MUCH TEXT instead of following instructions.
Added the following direction to the end of my prompt to help direct Llama 3.2:
Do not include your reasoning, only respond with a single value.
Your analysis:
Results:
=== Performance Summary ===
Model: llama3.2
=== Performance Statistics ===
Total comments processed: 1900
Total execution time: 323.217 seconds
Average time per comment: 0.170 seconds
Average content length: 1665.28 characters
Min execution time: 0.055 seconds
Max execution time: 1.115 seconds
Correlation between content length and execution time: 0.679
=== Classification Results ===
SPAM: 1718 (90.4%)
NOT_SPAM: 180 (9.5%)
I can help you analyze the text. However, I don't see any text to analyze in your request. Could you please provide the text that you would like me to analyze for spam indicators?: 1 (0.1%)
I would categorize the provided text as NOT_SPAM. There are no obvious indicators of spam such as excessive links or promotional language that seems unrelated to the post's content. The text appears to be a straightforward advertisement for an Arena FIGHTING app, and the structure of the message is clear and concise.: 1 (0.1%)
=== Detailed Results ===
Comment ID Length Execution Time (s) Result
0 1 2292 0.235777 SPAM
1 2 201 0.099050 NOT_SPAM
2 3 1866 0.238307 SPAM
3 4 552 0.176865 SPAM
4 5 1450 0.177989 SPAM
... ... ... ... ...
1895 1896 171 0.090118 SPAM
1896 1897 648 0.130774 SPAM
1897 1898 764 0.113110 SPAM
1898 1899 670 0.110565 SPAM
1899 1900 586 0.113194 SPAM
Deepseek#
I needed to strip the thinking output from Deepseek-R1’s output, as it is a thinking model.
I used an extremely basic and fragile regular expression to strip this, probably do better in production, but I’m doing this just for the blog post.
cleaned_response = re.sub(r"<think>.*</think>", "", response, flags=re.DOTALL)
Deepseek R1 Qwen 7B#
Model: deepseek-r1 7b Qwen 0a8c26691023
Deepseek is a thinking model, so it thinks and takes an extensive amount of time. I don’t have time to run 1900 comments through this model, so I’ve done 19.
Deepseeks is very permissive to spam, it seems! 47% is pretty high, and I know that this dataset is nearly entirely spam. :thinking emoji:
=== Performance Summary ===
Model: deepseek-r1
=== Performance Statistics ===
Total comments processed: 19
Total execution time: 158.028 seconds
Average time per comment: 8.317 seconds
Average content length: 1105.84 characters
Min execution time: 4.332 seconds
Max execution time: 11.107 seconds
Correlation between content length and execution time: 0.461
=== Classification Results ===
SPAM: 10 (52.6%)
NOT_SPAM: 9 (47.4%)
=== Detailed Results ===
Comment ID Length Execution Time (s) Result
0 1 2292 7.626532 SPAM
1 2 201 4.331504 NOT_SPAM
2 3 1866 9.196588 SPAM
3 4 552 6.144507 SPAM
4 5 1450 8.180156 NOT_SPAM
5 6 1335 9.637984 NOT_SPAM
6 7 305 7.373705 SPAM
7 8 148 7.517100 NOT_SPAM
8 9 195 8.402971 NOT_SPAM
9 10 1753 7.869087 SPAM
10 11 347 8.207781 NOT_SPAM
11 12 2354 7.928737 NOT_SPAM
12 13 1455 9.567515 SPAM
13 14 2065 10.776929 SPAM
14 15 1971 11.106560 SPAM
15 16 403 7.730168 NOT_SPAM
16 17 1790 9.258505 NOT_SPAM
17 18 288 10.659292 SPAM
18 19 241 6.512707 SPAM
Deepseek R1 Llama 8B#
Model: Deepseek R1 8B Llama
The distilled Llama model is much less permissive towards spam. Very… interesting!
=== Performance Summary ===
Model: deepseek-r1:8b
=== Performance Statistics ===
Total comments processed: 19
Total execution time: 130.012 seconds
Average time per comment: 6.843 seconds
Average content length: 1105.84 characters
Min execution time: 3.895 seconds
Max execution time: 15.398 seconds
Correlation between content length and execution time: 0.486
=== Classification Results ===
SPAM: 18 (94.7%)
NOT_SPAM: 1 (5.3%)
=== Detailed Results ===
Comment ID Length Execution Time (s) Result
0 1 2292 15.397510 SPAM
1 2 201 5.205121 NOT_SPAM
2 3 1866 8.467639 SPAM
3 4 552 7.522551 SPAM
4 5 1450 6.325638 SPAM
5 6 1335 7.330761 SPAM
6 7 305 3.895096 SPAM
7 8 148 8.660520 SPAM
8 9 195 4.523488 SPAM
9 10 1753 5.040600 SPAM
10 11 347 5.433511 SPAM
11 12 2354 7.559391 SPAM
12 13 1455 6.404013 SPAM
13 14 2065 6.801055 SPAM
14 15 1971 7.327605 SPAM
15 16 403 5.789532 SPAM
16 17 1790 6.085722 SPAM
17 18 288 5.825693 SPAM
18 19 241 6.416993 SPAM
Conclusions#
- I should summarize this article with an LLM
- I’ll stick to a small model like Gemma3 for execution time