I have been meaning to resurrect my blog for a long time. I’ve written many versions of this post, whether on dev.to, wordpress.org, Obsidian Publish, etc. etc. etc. I have spent hundreds of dollars on the question “how do blog?!” and I’m back to “Pay $5/mo for a VPS, put HTML files on the internet.”

I have an old Drupal 5 blog from ~2014 which I migrated my ~2010 Drupal 4 blog posts from. I really liked the 2014 era content from when I first started experimenting with software defined radio and had some moderately well-commented posts on GQRX SDR and Linux gaming at the time.

Aaaaanyway. I have a 19k comments in my Drupal 5 blog and only about 20 that I remember being relevant or interesting and worth preserving.

TLDR: Check my Github Repo#

I have created a repo to convert my ancient MySQL database dumps containing all the content of my Drupal 5 blog. This is a work in progress.

Restoring 2010s Drupal 5 blog to modern tech stack

  • Converting the Drupal 5 blog posts to Markdown to use with my Hugo blog
  • Filtering the comments using Ollama to find the very small number of non-spam comments

Local-only#

I strongly prefer to run a local LLM when possible. Ollama makes this shockingly easy. For this article I will stick to local-only models

Example spam comments#

Here are some obvious spam comments. Yes, there’s swearing and such. Rating: PG-16

*************************** 8. row ***************************
          comment_subject: Good evening! you look: - DVD
         comment_username: aperoiree
       comment_user_email: [email protected]
comment_user_homepage_url: http://gay.eurovids.us/
          comment_content: Good evening! you look: - DVD <a href=http://gay.eurovids.us/>ass and dick gay </a> - gay men in suits   or DVD <a href=http://gay.eurovids.us/>gay clint taylor </a> - working stiff dvd gay  , <a href=http://gay.eurovids.us/>free gay video sex clips </a> - gay fuck legs up tgp  ...!!! Let's keep in touch
*************************** 9. row ***************************
          comment_subject: It's hard to find experienced
         comment_username: bridesmaid shoes store
       comment_user_email: [email protected]
comment_user_homepage_url: http://www.sederhana.jp/user/view/profile/login/thfdoming
          comment_content: It's hard to find experienced people in this particular topic, however, you sound like you know what you'&#1075;e talking &#1072;bout!
Than&kappa;s
*************************** 10. row ***************************
          comment_subject: Rеmагκable thіngѕ heгe.
         comment_username: bridal Shoes Shop
       comment_user_email: [email protected]
comment_user_homepage_url: http://www.compilots.com/topsites/index.php?a=stats&u=sonjapitcher
          comment_content: R&#1077;m&#1072;&#1075;&kappa;able th&#1110;ng&#1109; he&#1075;e.

I’m very glad to look your post. Thanks a lot and I am taking a look forward to contact you. Will you please drop me a e-mail?

LLM Prompt for spam identification#

I asked Copilot to generate me a prompt to help with the spam detection. It was pretty good off the bat and works well enough. Perfect is the enemy of good, so I’ll keep something that is working

You are evaluating blog comments to detect spam. Analyze the following comment
and determine if it's spam.

Comment details:
- Subject: {}
- Username: {}
- Email: {}
- Homepage URL: {}
- Content: {}

Look for these spam indicators:
1. Excessive or irrelevant links
2. Generic, unrelated content
3. Promotional language unrelated to the post
4. Suspicious URLs or email patterns
5. Mismatched username/email combinations
6. Nonsensical text or keyword stuffing

Respond with ONLY ONE of these two statements:
- "SPAM" if you determine this is likely spam
- "NOT_SPAM" if you believe this is a legitimate comment

Your analysis:

Performance#

Google’s Gemma3#

Model: Google’s Gemma3 c0494fe00251

Sample of 1900/~19,000 (10%)

=== Performance Summary ===
Model: gemma3

=== Performance Statistics ===
Total comments processed: 1900
Total execution time: 513.862 seconds
Average time per comment: 0.270 seconds
Average content length: 1665.28 characters
Min execution time: 0.092 seconds
Max execution time: 10.471 seconds

Correlation between content length and execution time: 0.127

=== Classification Results ===
SPAM: 1889 (99.4%)
ERROR: 6 (0.3%)
NOT_SPAM: 5 (0.3%)

=== Detailed Results ===
      Comment ID  Length  Execution Time (s) Result
0              1    2292            7.420324   SPAM
1              2     201            0.149500   SPAM
2              3    1866            0.330189   SPAM
3              4     552            0.232348   SPAM
4              5    1450            0.244984   SPAM
...          ...     ...                 ...    ...
1895        1896     171            0.140129   SPAM
1896        1897     648            0.187439   SPAM
1897        1898     764            0.169726   SPAM
1898        1899     670            0.169026   SPAM
1899        1900     586            0.173447   SPAM

Llama 3.2 - 3B text-only#

Model: Llama 3.2 3B text-only a80c4f17acd5

Llama 3.2 decided to ignore my prompt and output a block of reasoning along with its score of spam.

I didn’t account for this, so I ended up losing my console output to the text from 1900 responses from the model filling up my terminal buffer. Oops!

This took ~15 minutes, seemingly because the model was generating SO MUCH TEXT instead of following instructions.

Added the following direction to the end of my prompt to help direct Llama 3.2:

Do not include your reasoning, only respond with a single value.

Your analysis:

Results:

=== Performance Summary ===
Model: llama3.2

=== Performance Statistics ===
Total comments processed: 1900
Total execution time: 323.217 seconds
Average time per comment: 0.170 seconds
Average content length: 1665.28 characters
Min execution time: 0.055 seconds
Max execution time: 1.115 seconds

Correlation between content length and execution time: 0.679

=== Classification Results ===
SPAM: 1718 (90.4%)
NOT_SPAM: 180 (9.5%)
I can help you analyze the text. However, I don't see any text to analyze in your request. Could you please provide the text that you would like me to analyze for spam indicators?: 1 (0.1%)
I would categorize the provided text as NOT_SPAM. There are no obvious indicators of spam such as excessive links or promotional language that seems unrelated to the post's content. The text appears to be a straightforward advertisement for an Arena FIGHTING app, and the structure of the message is clear and concise.: 1 (0.1%)

=== Detailed Results ===
      Comment ID  Length  Execution Time (s)    Result
0              1    2292            0.235777      SPAM
1              2     201            0.099050  NOT_SPAM
2              3    1866            0.238307      SPAM
3              4     552            0.176865      SPAM
4              5    1450            0.177989      SPAM
...          ...     ...                 ...       ...
1895        1896     171            0.090118      SPAM
1896        1897     648            0.130774      SPAM
1897        1898     764            0.113110      SPAM
1898        1899     670            0.110565      SPAM
1899        1900     586            0.113194      SPAM

Deepseek#

I needed to strip the thinking output from Deepseek-R1’s output, as it is a thinking model.

I used an extremely basic and fragile regular expression to strip this, probably do better in production, but I’m doing this just for the blog post.

cleaned_response = re.sub(r"<think>.*</think>", "", response, flags=re.DOTALL)

Deepseek R1 Qwen 7B#

Model: deepseek-r1 7b Qwen 0a8c26691023

Deepseek is a thinking model, so it thinks and takes an extensive amount of time. I don’t have time to run 1900 comments through this model, so I’ve done 19.

Deepseeks is very permissive to spam, it seems! 47% is pretty high, and I know that this dataset is nearly entirely spam. :thinking emoji:

=== Performance Summary ===
Model: deepseek-r1

=== Performance Statistics ===
Total comments processed: 19
Total execution time: 158.028 seconds
Average time per comment: 8.317 seconds
Average content length: 1105.84 characters
Min execution time: 4.332 seconds
Max execution time: 11.107 seconds

Correlation between content length and execution time: 0.461

=== Classification Results ===
SPAM: 10 (52.6%)
NOT_SPAM: 9 (47.4%)

=== Detailed Results ===
    Comment ID  Length  Execution Time (s)    Result
0            1    2292            7.626532      SPAM
1            2     201            4.331504  NOT_SPAM
2            3    1866            9.196588      SPAM
3            4     552            6.144507      SPAM
4            5    1450            8.180156  NOT_SPAM
5            6    1335            9.637984  NOT_SPAM
6            7     305            7.373705      SPAM
7            8     148            7.517100  NOT_SPAM
8            9     195            8.402971  NOT_SPAM
9           10    1753            7.869087      SPAM
10          11     347            8.207781  NOT_SPAM
11          12    2354            7.928737  NOT_SPAM
12          13    1455            9.567515      SPAM
13          14    2065           10.776929      SPAM
14          15    1971           11.106560      SPAM
15          16     403            7.730168  NOT_SPAM
16          17    1790            9.258505  NOT_SPAM
17          18     288           10.659292      SPAM
18          19     241            6.512707      SPAM

Deepseek R1 Llama 8B#

Model: Deepseek R1 8B Llama

The distilled Llama model is much less permissive towards spam. Very… interesting!

=== Performance Summary ===
Model: deepseek-r1:8b

=== Performance Statistics ===
Total comments processed: 19
Total execution time: 130.012 seconds
Average time per comment: 6.843 seconds
Average content length: 1105.84 characters
Min execution time: 3.895 seconds
Max execution time: 15.398 seconds

Correlation between content length and execution time: 0.486

=== Classification Results ===
SPAM: 18 (94.7%)
NOT_SPAM: 1 (5.3%)

=== Detailed Results ===
    Comment ID  Length  Execution Time (s)    Result
0            1    2292           15.397510      SPAM
1            2     201            5.205121  NOT_SPAM
2            3    1866            8.467639      SPAM
3            4     552            7.522551      SPAM
4            5    1450            6.325638      SPAM
5            6    1335            7.330761      SPAM
6            7     305            3.895096      SPAM
7            8     148            8.660520      SPAM
8            9     195            4.523488      SPAM
9           10    1753            5.040600      SPAM
10          11     347            5.433511      SPAM
11          12    2354            7.559391      SPAM
12          13    1455            6.404013      SPAM
13          14    2065            6.801055      SPAM
14          15    1971            7.327605      SPAM
15          16     403            5.789532      SPAM
16          17    1790            6.085722      SPAM
17          18     288            5.825693      SPAM
18          19     241            6.416993      SPAM

Conclusions#

  • I should summarize this article with an LLM
  • I’ll stick to a small model like Gemma3 for execution time