All your forms are belong to bots, part deux, the sequel. Again.
by schwim on Apr.01, 2009, under Spam
Who’d have thunk it that a blog post concerning spam bots would be of interest in this day and age? After watching the search results pile up for a blog post that was intended to be a light touch upon a subject, I started feeling bad. Everyone kept showing up here looking for a magic bullet and I gave them a slingshot. Neigh, a plastic spork.
Well, this time, the spork is made of wood!
What? You were hoping for a silver bullet the second time around? Woops.
Let me just say this: As long as there are people with no morals and a lot of greed, you will have spam. The upside to this is that you can have a lot less spam than your neighbor.
How far you have to go depends on the popularity of your site and the type of form that you’re trying to protect.
Anything that gets moderated prior to publication is less critical than something that gets published without any kind of administrative approval. Forum posts pose much more of a threat than site contact forms. With the site contact form, you’re dealing with a nuisance. With a forum, you’re putting your site ranking at risk.
I can tell you that I have successfully protected both ends of the spectrum from spam. In both cases it took work, but the free time you lose during the coding process is made up when you don’t have to delete hundreds(or thousands) of posts and bogus users.
About CAPTCHAs: Before I get into what I use, I want to tell you what I don’t use. I don’t use prepackaged CAPTCHAs. There’s two reasons for this: Firstly, for every system that gets used in the mainstream, there are hundreds of spammers cracking the code for use. 99% of the prepackaged systems don’t remain viable for long. Secondly, why in the hell would you task your customer/visitor to carry the burden of protecting your site from spam? Recaptcha has become incredibly popular for the web developer that doesn’t like spam but is too lazy to do something about it. I visited one site that used it and after typing two words in, I had to copy and paste about 200 characters of text from one box to another. This is after successfully completing the challenge words.
Are you fucking kidding me?
Add to this the ever-increasing obfuscation of the challenge text and you’ve got a real winner of a system on your hands. About 25% of the time, I get them wrong and I’m a human. For those in the dark, when humans can’t pass a test that tells humans and computers apart, the test needs work. These are the reasons I’ve grown so hostile towards the idea of prepackaged CAPTCHAs. They’ve outsmarted themselves. They no longer tell humans and computers apart, they just separate the persistent from the apathetic.
If you’re willing to do some work to protect your forms, then here’s some ideas that have worked very well for me, in addition to what was already mentioned in the previous post:
Home-grown CAPTCHA: I run a VW enthusiast site that was getting hundreds of bot registrations a week. I created the following CAPTCHA to use during registration:
Please de-scramble the following letters to reveal the coolest ride in the world:
WV
with a two letter length text box below.
If they get it right, it submits the user info and continues on the registration process. If they fail it, it loops the registration process back to the beginning telling them to take their time on the hard questions.
That alone put a complete halt to bogus registrations. It’s been over a year and I’ve not had a false registration yet. Why? Simple. Although it’s the essence of simplicity, it’s not been seen by a bot yet and it doesn’t know what to do with it. This form does what pre-fuzzed, randomly generated text can’t do.
I don’t mind using a challenge/pass type system for single time things like registrations, but I won’t use something even so simplistic as that for something that people may use more than once. As I’ve already stated, it’s not their job to protect the site. So what do I use for other forms?
The one I spent the most time on is a testimonial script that I released to the public. As the script became more popular, more people began complaining that automated bots were posting up to hundreds of testimonials in the period of minutes. The problem was that the testimonial block that got included on every one of their site pages had a link that invited visitors to submit a testimonial. That link was getting hit by the bots over and over again. It didn’t pose any risk to the site because the testimonials were invisible until approved but having to delete hundreds of bogus posts a day gets old quick. So I set to work on putting a dent in the problem for the next release.
First, how do bots work? In my estimation, almost always one of two ways:
1) The bot visits your form and submits the information.
2) A person visits your form and tailors the bot to submit the information to the processing portion of the script, bypassing the form altogether.
Here’s the things I did to combat the bogus submissions. So far, I have had absolutely no reports of spam and the system has been in use for over a year.
To combat the bots that submit data at the form and submit it, I put a very simple timer on the form, which is completely customizable. I figured it should take at least 60 seconds to fill out the form. Any quicker than that and it’s bogus, whether it’s from a bot or not. Nobody could fill out the form I had with meaningful information in less time than 60 seconds. Any more than 30 minutes is bogus as well. So on the form side, I used mktime() to come up with a timestamp, setting it as a session var and then checked it on the processing end. If it was less than 60, then no soup for the submitter. They did get a pretty error, but I didn’t save their data. Any multi-element formed filled out in under 60 seconds didn’t have a lot of content to begin with. I did the same for when the timestamp came out at 30 mins or more. Bots fill out forms incredibly quickly. It doesn’t take long for them to submit and they have a lot of sites to attack, so they’re in a hurry. I could have set the initial number much lower and still been protected, but I leave that up to the script user.
The next thing I did was to combat the bot that saved their own copy of the form or the bot that never visited it at all, instead posting data directly to the processing portion of the script.
I created a four digit random number on each form generation that became a sort of ID. $seed = rand(1000,9999); I then took that ID and prepended it to all form element names:
name = ‘name’ became name = ‘”.$seed.”name’ and during form generation, if you looked at the source of the page, it might look like name = ’9373name’. Then the seed got saved as a session variable which we retrieved on the processing side. If the bots submitted info repeatedly from a saved form generation, then there was no session variable to match up their form data to and the form element names they submitted didn’t exist. The form didn’t know what to do with them. It’s very simple. If the session seed value either doesn’t exist or doesn’t match the form elements, then it’s not a real user.
As I said, so far I haven’t had any reports of bots getting past this system. If they do, my next step will be to use Akismet. It works incredibly well for blogs, it’s free to use and it still doesn’t burden the end user. I’ve used it successfully on forums and blogs.
Other than those particular ideas, keep in mind that common sense rules the day. People get spam because spambots are able to submit spam on their site. Sound oversimplified? Bots are written to handle common site scripts. They have a database of the most common form element names, thanks to the popularity of vbulletin, phpbb and others. The majority of the sites all around the world have the exact same CAPTCHA, the same form element names and the same hurdles to overcome in order to post. If you’re like every other site, then you have no protection. Obscurity works, in this case. If your form element names are “asfdasdfewrg”, “tubesockpreference” or “superman”, then a bot doesn’t know what portion of that form is looking for. In this case, they’ll often drop a link or spam text into it as a hail mary, but often, this will trigger a validity check and the bot will end up in an endless loop.
Of course, you can always use Recaptcha and leave it up to us to protect you. Those of us that don’t leave will do our best.
1 Trackback or Pingback for this entry
April 21st, 2009 on 1:03 pm
[...] article discussing further measures to combat bots can be found here[/EDIT] This entry was posted on Sunday, August 5th, 2007 at 10:05 pm and is filed under Spam. You [...]