How to Block Bots in Apache Htaccess

Recently I had an application become the victim of bot spam. Since the web is something on the order of 60% bot traffic, many of these are inconsequential and can safely be blocked. I chose to block them based on user agent, since many of these bots have a range of IP addresses they can utilize.
Here is a list of the bots I was able to block from several application, with out impacting SEO. Not all of these bots will be right to block for every application. The top listed one, “^$” is the regex for an empty string. I do not allow bots to access the pages unless they identify with a user-agent, I found most often the only things hitting my these applications with out a user agent were security tools gone rogue.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
^$
EasouSpider
Add Catalog
PaperLiBot
Spiceworks
ZumBot
RU_Bot
Wget
Java/1.7.0_25
Slurp
FunWebProducts
80legs
Aboundex
AcoiRobot
Acoon Robot
AhrefsBot
aihit
AlkalineBOT
AnzwersCrawl
Arachnoidea
ArchitextSpider
archive
Autonomy Spider
Baiduspider
BecomeBot
benderthewebrobot
BlackWidow
Bork-edition
Bot mailto:craftbot@yahoo.com
botje
catchbot
changedetection
Charlotte
ChinaClaw
commoncrawl
ConveraCrawler
Covario
crawler
curl
Custo
data mining development project
DigExt
DISCo
discobot
discoveryengine
DOC
DoCoMo
DotBot
Download Demon
Download Ninja
eCatch
EirGrabber
EmailSiphon
EmailWolf
eurobot
Exabot
Express WebPictures
ExtractorPro
EyeNetIE
Ezooms
Fetch
Fetch API
filterdb
findfiles
findlinks
FlashGet
flightdeckreports
FollowSite Bot
Gaisbot
genieBot
GetRight
GetWeb!
gigablast
Gigabot
Go-Ahead-Got-It
Go!Zilla
GrabNet
Grafula
GT::WWW
hailoo
heritrix
HMView
houxou
HTTP::Lite
HTTrack
ia_archiver
IBM EVV
id-search
IDBot
Image Stripper
Image Sucker
Indy Library
InterGET
Internet Ninja
internetmemory
ISC Systems iRc Search 2.1
JetCar
JOC Web Spider
k2spider
larbin
larbin
LeechFTP
libghttp
libwww
libwww-perl
linko
LinkWalker
lwp-trivial
Mass Downloader
metadatalabs
MFC_Tear_Sample
Microsoft URL Control
MIDown tool
Missigua
Missigua Locator
Mister PiX
MJ12bot
MOREnet
MSIECrawler
msnbot
naver
Navroad
NearSite
Net Vampire
NetAnts
NetSpider
NetZIP
NextGenSearchBot
NPBot
Nutch
Octopus
Offline Explorer
Offline Navigator
omni-explorer
PageGrabber
panscient
panscient.com
Papa Foto
pavuk
pcBrowser
PECL::HTTP
PHP/
PHPCrawl
picsearch
pipl
pmoz
PredictYourBabySearchToolbar
RealDownload
Referrer Karma
ReGet
reverseget
rogerbot
ScoutJet
SearchBot
seexie
seoprofiler
Servage Robot
SeznamBot
shopwiki
sindice
sistrix
SiteSnagger
SiteSnagger
smart.apnoti.com
SmartDownload
Snoopy
Sosospider
spbot
suggybot
SuperBot
SuperHTTP
SuperPagesUrlVerifyBot
Surfbot
SurveyBot
SurveyBot
swebot
Synapse
Tagoobot
tAkeOut
Teleport
Teleport Pro
TeleportPro
TweetmemeBot
TwengaBot
twiceler
UbiCrawler
uptimerobot
URI::Fetch
urllib
User-Agent
VoidEYE
VoilaBot
WBSearchBot
Web Image Collector
Web Sucker
WebAuto
WebCopier
WebCopier
WebFetch
WebGo IS
WebLeacher
WebReaper
WebSauger
Website eXtractor
Website Quester
WebStripper
WebStripper
WebWhacker
WebZIP
WebZIP
Wells Search II
WEP Search
Widow
winHTTP
WWWOFFLE
Xaldon WebSpider
Xenu
yacybot
yandex
YandexBot
YandexImages
yBot
YesupBot
YodaoBot
yolinkBot
youdao
Zao
Zealbot
Zeus
ZyBORG
Most often you find people blocking Bots using something like this. I found this just continues to grow the htaccess file and adds a lot of unneeded lines to the file. Why do with two hundred lines what can be accomplished in two?
1
2
3
4
5
RewriteCond %{HTTP_USER_AGENT} ^Snoopy [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^VB\ Project [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WWW::Mechanize [NC,OR]
RewriteCond %{HTTP_USER_AGENT} RPT-HTTPClient [NC]
RewriteRule .* - [R=403,L]
I found thats I could achieve the same effect with a two lines and adding more entries became easier since they were just separated by the pipe character signifying “or” in the regex. It seemed cleaner to me to have 2 lines doing the work of what was 232 lines.
You can see this in my .htaccess boiler plate below.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475
RewriteEngine on
RewriteBase /
Options -Indexes
 
# disallow access to special directories and feed back a 404 error
RedirectMatch 404 /\\.svn(/|$)
RedirectMatch 404 /\\.git(/|$)
 
# set headers that will override server defaults.
#Header set X-UA-Compatible "IE=9"
 
# ANYWHERE IN UA -- Block Bad Bots Greedy Regex
# Blocks a list of bots that are were unecessary for the applications I have created.
# note any of these bots can be removed from the listing
# make sure you know why you are blocking a bot or allowing it access.
# primary search engines are not blocked by this entry.
RewriteCond %{HTTP_USER_AGENT} ^.*(^$|CareerBot|CAUpdate|SiteExplorer|FreeBSD|ContextAd|YisouSpider|YahooCacheSystem|Synthomatic|Webmin|SEOENGWorldBot|ADmantX|linkdexbot|MojeekBot|niki\-bot|adidxbot|Scanmine|ia\_archiver|SuperPagesUrlVerifyBot|Apache\-HttpClient|PiplBot|ImageFetcher|pmoz|psbot|CATExplorador|Wget|EasouSpider|Add\ Catalog|PaperLiBot|Spiceworks|ZumBot|Java\/1\.7\.0\_45|Java\/1\.7\.0\_21|SemrushBot|Vigil|proximic|OpenfosBot|bitlybot|musobot|URLAppendBot|AboutUsBot|meanpathbot|Slurp|IstellaBot|GrapeshotCrawler|YandexImages|GarlikCrawler|A6\-Indexer|80legs|Aboundex|AcoiRobot|Acoon\ Robot|AhrefsBot|aihit|AlkalineBOT|AnzwersCrawl|Arachnoidea|ArchitextSpider|archive|Autonomy\ Spider|Baiduspider|BecomeBot|benderthewebrobot|BlackWidow|Bork\-edition|Bot\ mailto\:craftbot@yahoo\.com|botje|catchbot|changedetection|Charlotte|ChinaClaw|commoncrawl|ConveraCrawler|Covario|crawler|curl|Custo|data\ mining\ development\ project|DigExt|DISCo|discobot|discoveryengine|DOC|DoCoMo|DotBot|Download\ Demon|Download\ Ninja|eCatch|EirGrabber|EmailSiphon|EmailWolf|eurobot|Exabot|Express\ WebPictures|ExtractorPro|EyeNetIE|Ezooms|Fetch|Fetch\ API|filterdb|findfiles|findlinks|FlashGet|flightdeckreports|FollowSite\ Bot|Gaisbot|genieBot|GetRight|GetWeb\!|gigablast|Gigabot|Go\-Ahead\-Got\-It|Go\!Zilla|GrabNet|Grafula|GT\:\:WWW|hailoo|heritrix|HMView|houxou|HTTP\:\:Lite|HTTrack|ia\_archiver|IBM\ EVV|id\-search|IDBot|Image\ Stripper|Image\ Sucker|Indy\ Library|InterGET|Internet\ Ninja|internetmemory|ISC\ Systems\ iRc\ Search\ 2\.1|JetCar|JOC\ Web\ Spider|k2spider|larbin|larbin|LeechFTP|libghttp|libwww|libwww\-perl|linko|LinkWalker|lwp\-trivial|Mass\ Downloader|metadatalabs|MFC\_Tear\_Sample|Microsoft\ URL\ Control|MIDown\ tool|Missigua|Missigua\ Locator|Mister\ PiX|MJ12bot|MOREnet|MSIECrawler|msnbot|naver|Navroad|NearSite|Net\ Vampire|NetAnts|NetSpider|NetZIP|NextGenSearchBot|NPBot|Nutch|Octopus|Offline\ Explorer|Offline\ Explorer|Offline\ Navigator|omni\-explorer|PageGrabber|panscient|panscient\.com|Papa\ Foto|pavuk|pcBrowser|PECL\:\:HTTP|PHP/|PHPCrawl|picsearch|pipl|pmoz|PredictYourBabySearchToolbar|RealDownload|Referrer\ Karma|ReGet|reverseget|rogerbot|ScoutJet|SearchBot|seexie|seoprofiler|Servage\ Robot|SeznamBot|shopwiki|sindice|sistrix|SiteSnagger|SiteSnagger|smart\.apnoti\.com|SmartDownload|Snoopy|Sosospider|spbot|suggybot|SuperBot|SuperHTTP|SuperPagesUrlVerifyBot|Surfbot|SurveyBot|SurveyBot|swebot|Synapse|Tagoobot|tAkeOut|Teleport|Teleport\ Pro|TeleportPro|TweetmemeBot|TwengaBot|twiceler|UbiCrawler|uptimerobot|URI\:\:Fetch|urllib|User\-Agent|VoidEYE|VoilaBot|WBSearchBot|Web\ Image\ Collector|Web\ Sucker|WebAuto|WebCopier|WebCopier|WebFetch|WebGo\ IS|WebLeacher|WebReaper|WebSauger|Website\ eXtractor|Website\ Quester|WebStripper|WebStripper|WebWhacker|WebZIP|WebZIP|Wells\ Search\ II|WEP\ Search|Widow|winHTTP|WWWOFFLE|Xaldon\ WebSpider|Xenu|yacybot|yandex|YandexBot|yBot|YesupBot|YodaoBot|yolinkBot|youdao|Zao|Zealbot|Zeus|ZyBORG).*$ [NC]
RewriteRule .* - [R=403,L]
 
 
#SQL Injection Protection --Read More www.cybercrime.gov
# Block MySQL injections, RFI, base64, etc.
# a list of regex request blocking that will help block SQL injection.
# note: This is an added layer of defense, but these are not substitutes for good code practices in general.
# Please make sure you take all of the precautions in your code as well.
 
RewriteCond %{QUERY_STRING} [a-zA-Z0-9_]=http:// [OR]
RewriteCond %{QUERY_STRING} [a-zA-Z0-9_]=(\.\.//?)+ [OR]
RewriteCond %{QUERY_STRING} [a-zA-Z0-9_]=/([a-z0-9_.]//?)+ [NC,OR]
RewriteCond %{QUERY_STRING} \=PHP[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12} [NC,OR]
RewriteCond %{QUERY_STRING} (\.\./|\.\.) [OR]
RewriteCond %{QUERY_STRING} ftp\: [NC,OR]
RewriteCond %{QUERY_STRING} http\: [NC,OR]
RewriteCond %{QUERY_STRING} https\: [NC,OR]
RewriteCond %{QUERY_STRING} \=\|w\| [NC,OR]
RewriteCond %{QUERY_STRING} ^(.*)/self/(.*)$ [NC,OR]
RewriteCond %{QUERY_STRING} ^(.*)cPath=http://(.*)$ [NC,OR]
RewriteCond %{QUERY_STRING} (\<|%3C).*script.*(\>|%3E) [NC,OR]
RewriteCond %{QUERY_STRING} (<|%3C)([^s]*s)+cript.*(>|%3E) [NC,OR]
RewriteCond %{QUERY_STRING} (\<|%3C).*iframe.*(\>|%3E) [NC,OR]
RewriteCond %{QUERY_STRING} (<|%3C)([^i]*i)+frame.*(>|%3E) [NC,OR]
RewriteCond %{QUERY_STRING} base64_encode.*\(.*\) [NC,OR]
RewriteCond %{QUERY_STRING} base64_(en|de)code[^(]*\([^)]*\) [NC,OR]
RewriteCond %{QUERY_STRING} GLOBALS(=|\[|\%[0-9A-Z]{0,2}) [OR]
RewriteCond %{QUERY_STRING} _REQUEST(=|\[|\%[0-9A-Z]{0,2}) [OR]
RewriteCond %{QUERY_STRING} ^.*(\[|\]|\(|\)|<|>).* [NC,OR]
RewriteCond %{QUERY_STRING} (NULL|OUTFILE|LOAD_FILE) [OR]
RewriteCond %{QUERY_STRING} (\./|\../|\.../)+(motd|etc|bin) [NC,OR]
RewriteCond %{QUERY_STRING} (localhost|loopback|127\.0\.0\.1) [NC,OR]
RewriteCond %{QUERY_STRING} (<|>|'|%0A|%0D|%27|%3C|%3E|%00) [NC,OR]
RewriteCond %{QUERY_STRING} concat[^\(]*\( [NC,OR]
RewriteCond %{QUERY_STRING} union([^s]*s)+elect [NC,OR]
RewriteCond %{QUERY_STRING} union([^a]*a)+ll([^s]*s)+elect [NC,OR]
RewriteCond %{QUERY_STRING} (;|<|>|'|"|\)|%0A|%0D|%22|%27|%3C|%3E|%00).*(/\*|union|select|insert|drop|delete|update|cast|create|char|convert|alter|declare|order|script|set|md5|benchmark|encode) [NC,OR]
RewriteCond %{QUERY_STRING} (sp_executesql) [NC]
RewriteRule ^(.*)$ - [F,L]
 
# the below 2 lines can be used to block specific addresses.
# use this with caution
#RewriteCond %{REMOTE_ADDR} ^255\.255\.255\.255$
#RewriteRule .* - [F]
 
 
#if image is requested then deliver as it is. otherwise not found message
RewriteCond %{REQUEST_URI} \.(bmp|gif|jpe?g|png|ico)$
RewriteRule ^(.*)$ - [NC,L]
 
#if css or js file is requested then deliver as it is. otherwise not found message
RewriteCond %{REQUEST_URI} \.(css|js)$
RewriteRule ^(.*)$ - [NC,L]
 
#if txt, doc, pdf,xls, or xml is requested then deliver as it is. otherwise not found message
RewriteCond %{REQUEST_URI} \.(txt|doc|pdf|xls|xml)$
RewriteRule ^(.*)$ - [NC,L]
 
view raw.htaccesshosted with ❤ by GitHub
I will advise you when blocking bots be very specific. Simply using a generic word like “fire” could pop positive for “firefox” You can also adjust the regex to fix that issue but I found it much simpler to be more specific and that has the added benefit of being more informative to the next person to touch the .htaccess.
Additionally, you will see I have a rule for Java/1.7.0_25 in this case it happened to be a bot using this version of java to slam my servers. Do be careful blocking language specific user agents like this, some languages such as ColdFusion run on the JVM and use the language user agent and web requests to localhost to assemble things like PDFs. Jruby, Groovy, or Scala, may do similar things, however I have not tested them.
Thanks for reading.

Yorumlar

Bu blogdaki popüler yayınlar

Uzak Masaüstü Bağlantı Geçmişini Silmek

TERMINAL SERVICES UNLIMITED

Gpupdate Komutu