dallas_speaks
. transcribed from email convo: 061024
...
u24 Hello,

blather has been spammed* and some of us (I don't know how many yet) are thinking about ways the spam could be stopped, but they all involved code changes. I know you're both busy people so my question is this: if we code it and can attain a high level of community consensus on the code changes, would you be willing to put it in place? (the exact form of the code changes needed/wanted has yet to be decided).

The debate is going on at http://blather.newdream.net/c/can_something_be_done_about_these.html
I'm sure we'd all love one of both of you to join in.
Please help us out

many thanks,
-u24

* and by spam I mean stuff that is universally agreed to be spam, eg viagara links etc for examples, please see 'guestbook', 'unknown', 'miss' and others, some of which are linked from 'blather_spam'.
061024
...
dallas Hi u24,

Somebody emailed us about this a few days ago but I hadn't had a chance to take a look yet. Blather right now has no concept of session or authentication so anything like that aspect of things would have to start fresh. I think the only places that would need updating are the ?add command (as in blather?add) and addform.

The logic could probably go something like this...

addform page checks for session cookie
if not found it includes a captcha question
if found, carry on like normal
add command checks for session cookie
if not found, send user back to addform page

Would something like that interfere with your blight project? Does your PHP post directly to "blather?add" ? Does it seem like that would do the trick? The 'session' could probably last forever so each legitimate user would only have to verify themselves once. That would still probably block most or all spam.

Have you looked at the blather perl source code? If someone proposes a code change I can stick it in.
061024
...
u24 Hi,

thanks for responding so quickly.
yeah, I've looked at the code, but my perl is very limited. anne-girl has hacked it more than I.
Don't worry about blight: no-one uses it, and I'm sure I can relay the captcha from blather to blight and back if needed.

Yeah, that's a good idea re: adding a captcha to ?add (much better than my idea of blocking all access to those that hadn't passed a captcha test that session).

If perl sessions are anything like PHP sessions, then it'd just be a case of adding a session_start(); call (or equivalent) to each page, though quite how this'd work for the .html pages I'm unsure. maybe it wouldn't need to. maybe passing the captcha could be stored as an IP whitelist rather than in a session var.
To clarify; to avoid having to use sessions, once the user has passed a captcha, their IP is stored for X amount of time in an "Ok to post entries" list".

Not sure if sessions/IP whitelists lasting forever is ideal, but that's probably because I'm used to having to worry about server management and the volume of data that infinite session-timeouts would involve scares me - then again with something as monsterously huge as blather I'm sure that a few thousand session files really wouldn't matter.

I know of a few perl captchas, but TBH something incredibly weak such as "enter the word 'blather' here:" would probably suffice; I'm pretty sure blather is being hit by bots rather than being specifically targetted, and I'm also sure that the bots aren't intelligent enough even to circumvent the most simple captchas (low hanging fruit and all that).
061024
...
u24 PS: do you mind if I copy your responses to can_something_be_done_about_these ? 061024
...
u24 something also probably needs doing to remove the existing spam links. I can compile a list of affected pages, but it'd probably be quicker for you to SELECT * FROM (relevant table) WHERE says LIKE "%http%", send me the resultant data and I can munch through it (or you can if boring repetitive tasks turn you on...) 061024
...
dallas The html pages don't need to care about the authentication because they're just html (a whole lot of it). We don't care if bots browse... only if they try to post.

Perl doesn't have a built-in session system like PHP so it'd just be a matter of setting a cookie and looking for it later. There'd be no need to save anything server-side (cookies are stored client-side by the browser). We only need to know that someone has jumped through the captcha hoop at some point in the past.


Just making people type in 'blather' is pretty close to just putting in a hidden variable in the addform that the add command looks for. That's an even quicker trick that stops spammers for a little while. When I started having a problem with spam on my blog I did that for awhile but they would catch on before too long and I had to keep changing the hidden word. That might not be an issue on blather, though.
061024
...
dallas That's fine if anyone cares to see the discussion! 061024
...
u24 "The html pages don't need to care about the authentication because they're just html (a whole lot of it). We don't care if bots browse... only if they try to post."

true.
exactly how much space and bandwidth does blather take up, btw? If it ever gets to silly amounts, re-writing all the pages so they use relative links would reduce space (someone else's idea, not mine)...

"Perl doesn't have a built-in session system like PHP so it'd just be a matter of setting a cookie and looking for it later. There'd be no need to save anything server-side (cookies are stored client-side by the browser). We only need to know that someone has jumped through the captcha hoop at some point in the past."

having the cookie stored client-side will make it spoofable. maybe spambots aren't clever enough though.
With PHP, if you break the session-ed connection (eg by loading a flat html page), the (server-side) session will sometimes be destroyed, which is one of the reasons I thought we'd need to session-ify all the html pages.


"Just making people type in 'blather' is pretty close to just putting in a hidden variable in the addform that the add command looks for. That's an even quicker trick that stops spammers for a little while."

I think the bots are actually parsing the form data and filling it automagically, so the hidden variable would need to be populated via javascript or something in order to avoid the bots just processing an extra field and still spamming.

"When I started having a problem with spam on my blog I did that for awhile but they would catch on before too long and I had to keep changing the hidden word. That might not be an issue on blather, though."

I think what's happening is that someone has a bunch of web crawlers looking for things like "guestbook" etc, and as I said, parsing the form for likely fields and autosubmitting. I'm not convinced that a human being has ever actually looked at blather and thought "hey, let's spam that". So no, I don't think they'd adapt very quickly.
061024
...
The Doar Is Smiling (in the 3rd person) user24, I'm profoundly humbled by your intiative. to get one of the blather_creators (akin to a god around these parts, good on ya blather_god) to respond so quickly is.......well....dammit....i'm speechless!

.
061024
...
u24

















it felt so weird typing 'dallas' in the you box...
061024
...
u24 thanks, doar, that means a lot. 061024
...
dallas It looks like it's about 1.4GB of html files right now and it uses
somewhere between 1-2GB of bandwidth a day. It's not that much
overall, actually.

A simple cookie isn't perfect but it'll probably work pretty well for
this.

I'll probably just give this hidden field method a shot and see what
happens. It's really easy to implement and might be effective enough
for awhile.


I'm going to dig into it a little before I watch a bit of
Smallville. 8-)

Ok, I deleted quite a lot of the spam but as you mentioned it's a bit
tedious. I think I probably got the worst of it. I also added the
very simple hidden form variable and that will hopefully foil the
bots for a little while. Let me know if you see more spam come in.
061025
...
() ( thank you. ) 061025
...
birdmad Amen! 061025
...
u24 thanks_dallas, that's just brilliant. seriously, muchly appreciated indeed.

We'll let you know if we find any more*, or if the spam starts again.

* actually, you missed some how_big_is_blather, miss, site and top, though most of those happened after you'd removed the rest (I pray that you removed existing spam then added the hidden fields, otherwise they've adapted already...)

---

they've already spammed "websites" now. :-(
filling the field with javascript should work..

---
yeah, unfortunately the time limit thing isn't stopping them either; 'unknown' is getting spammed as we speak.

the thing that really pisses me off is the fact that this spam is totally ineffective as urls don't get turned into links! eugh. not only annoying but useless!

anyway, yeah, javascript? I know it would block non-JS browsers, but I don't think anyone is using one. though I could be wrong, and maybe it's not a risk that should be taken.
061025
...
dallas The bots usually just do an http POST request directly to the spam add command so javascript in the addform won't make a difference. I'll take a look at what they're doing now.

Thanks for letting me know.
061025
...
u24 won't it? - if the add script won't allow content to be added unless field X has value Y, and the add form adds value "Y" to field "X" via JS when the user clicks the start button, then the spambot won't know (unless it understands JS), which value to put in field X. Am I missing something?


"Thanks for letting me know."
-thanks for helping us out here!
061025
...
u24 start button=submit button.
mind is wandering.
061025
...
dallas I was under the assumption that the bots are skipping the addform altogether. You don't need to access the form itself to send data to the blather script.

I looked at the ones today and both of them appear to be scripting their web browsers or spamming manually because the logs show a normal user set of requests. That sort of spamming would bypass any of the methods we've talked about so far other than captcha. I added some IP blocks since all the new ones I see from today have only come from a few different IPs. I'll work on some sort of captcha system now...
061025
...
u24 "I was under the assumption that the bots are skipping the addform altogether. You don't need to access the form itself to send data to the blather script."

yeah sure, but if the script requires a field that is only populated via JS on the previous form then the bot can directly request the script all it want. But that doesn't matter given that:

"I looked at the ones today and both of them appear to be scripting their web browsers or spamming manually because the logs show a normal user set of requests."

scripted web browsers... interesting approach.

"That sort of spamming would bypass any of the methods we've talked about so far other than captcha."

yeah, which is a shame.

"...all the new ones I see from today have only come from a few different IPs"

fair enough, my experience with my site was different but hopefully this'll keep some of them out. (I had to resort to captcha)

"I'll work on some sort of captcha system now..."

ok, great; I'll stop bugging you for a while :-)
061025
...
dallas The script on the server doesn't 'know' anything about javascript so
it can't know what value was filled in via javascript. The method I
used (that didn't work) fills in the hidden variable via the server
(the form itself is dynamically generated already) and then the
server would know to look for a value within a set range. That's why
the 60 minute time limit came into play. So this should prevent bots
from posting directly to the script without first accessing the form,
but they seem to be accessing the form and posting through it like a
normal user, now.
061025
...
u24 sorry, you misunderstand (or I explain badly), I know what you did and why and how it worked, but with the original hidden field setup, a script could access the form, extract all fields from the html, leave hidden fields alone, guess that a textarea will be the main text so fill spam in there, fill a random 'email' address in and put random spam in all other fields, then POST to the form processor. Using javascript to populate the hidden fields would mean that that approach wouldn't work, because the spambot wouldn't be able to work out what should go in the hidden fields. The server doesn't need to know that the fields were filled in via JS, just that they were filled in correctly, which a spambot wouldn't be able to do.

eg:

form.html:
[input type="hidden" name="antispam" value="bad_value_here"]
(other fields etc)
[input type="submit" onClick="this.form.antispam.value='special_value_here' ;"]

form processor:
if(antispam=="special_value_here")
{ post comment } else {kick user out}

I hope this clarifies what I was suggesting

it's redundant if they are scripting a JS-aware browser, or if they're going through the process manually (IMHO unlikely given the volume), or if they're going through manually once and then recording the traffic and sending that traffic to a custom script to replace X Y and Z fields with spam.
061025
...
u24 I think it's unlikely that what's happening is any of the "it's redundant if" scenarios mentioned above; the method I suggest of parsing form fields would be conducive to a one-size-fits-all spambot, like a modified webcrawler. I think that's what they're using - why would anyone manually script a browser/record traffic etc - to do anything manually is to realise that blather is not a good spam candidate. I may be wrong, they may be totally dumb. 061025
...
dallas Ah, got it. Well, I'm pretty close to having a captcha system
working now. I'll probably set it to require human verification
every 24 hours so people would have to do it once a day.
061025
...
u24 cool, that seems like the only ultimate way of stopping spam.
could I request that the captcha image be themed with the blather colours (eg blue text on dark blue background), if it's not too hard (it would just be nicer on the eye) :-)
thanks again for all your work on this.
061025
...
dallas ok, captcha in place. It's using an 'off the shelf' perl module which includes it's own images for backgrounds. I don't think it clashes too much. 061025
...
u24 no, it looks great, and we'll (hopefully) only be seeing it once a day.

how is the session expiration managed? is the client-side cookie set to expire within 24 hours or is it managed server-side?
061025
...
nom neat 061025
...
Dallas it's a client side cookie with an expiration time. I don't know how easy it might be to spoof cookies but I suspect most spammers wouldn't bother. 061025
...
stork daddy it's all magic to me. lovely to experience vicariously through you lovely people though. and thanks for once again doing so right by us. 061025
...
Ubiquitous Flattery Hooray! Thanks to Dallas for protecting this Fort's Worth! Ha ha ha! 061025
...
u24 it's very easy, but I think you're right they wouldn't bother. 061026
...
u24 erm, /red has an error 500 on the add form, not sure if this is related. 061026
...
u24 it only happens when you haven't previously entered the captcha on blue 061026
...
dallas ok, red should work now. I forgot about that... 061026
...
u24 what about /green, is that fixed as well? 061026
...
u24 just got a few odd errors while trying to post. I had already passed the captcha:

Service Temporarily Unavailable
The server is temporarily unable to service your request due to maintenance downtime or capacity problems. Please try again later.

has never seen that before.
061026
what's it to you?
who go
blather
from