SquidBlocker - About page

Thanks to

This is the place to thank my family and friends which supports me on each and every step of my long days. Without my family and my boss I might not had enough powers to build this piece of software.

Why?

It all started with SquidGuard and the need to block or allow websites. SquidGuard does a great job for years but it lacks couple things:

Due to this nature of SquidGuard it requires a restart\reload of squid for any DB\blacklists update and there for either adds complexity or down time for each. Not supporting concurrency also requires a huge amount of workers running in paralel for a busy system to not slow down the requests but at the cost of memory and CPU. A GoLang helper can handle about 2k concurrent requests. And in a production SMB office I am using 5 helpers which handles the same traffic which without concurrency support would need 40 SquidGuard helpers.

SquidBlocker for who it was meant?

SquidBlocker is there for sys-admins that needs to stay up for a very long periods of time without the need to restart or reload the DB server. One example is an hospital which human lives are mandatory and shutting down the service might lead into unwanted situations.

SquidBlocker DB library

SquidBlocker uses LevelDB as it's key value DB and there for is very fast. You can do anything in the LevelDB using other tools. On a central DB with multiple DB servers there is the option to rsync the whole DB when putting the server in DB close mode or shutting it down.(possible if you are using some load balancing reverse proxy)

Installation script

#!/usr/bin/env bash
cat > /etc/yum.repos.d/squid.repo <<EOF
[squid]
name=Squid repo for CentOS Linux
#IL mirror
baseurl=http://www1.ngtech.co.il/repo/centos/\$releasever/\$basearch/
failovermethod=priority
enabled=1
gpgcheck=0
EOF
yum update -y
yum install httpd php squidblocker -y
sed -i -e 's/AllowOverride\ None/AllowOverride All/g' /etc/httpd/conf/httpd.conf
sed -i -e 's/AllowOverride\ none/AllowOverride All/g' /etc/httpd/conf/httpd.conf
sed -i -e 's@DocumentRoot\ \"\/var\/www\/html\"@DocumentRoot "/var/www/squidblocker"@g' /etc/httpd/conf/httpd.conf
ln -s /var/www/block_page /var/www/squidblocker/block_page
systemctl enable httpd
systemctl enable sbserver
systemctl start httpd
systemctl start sbserver

SquidBlocker hub\broadcaster

Since HA and LB is a part of "high uptime" I wrote a small reverse proxy to allow mirroring updates and changes across multiple DB hosts.
You can configure the systemd script to use a comma separated string which declares the DBs hosts. The server is never returning any information about the success of the action but there is stdout\systemd output which will be thrown if there is an issue contacting any of the peers or completing the request.
The HUB can help to send PUT or GET messages to multiple DB hosts. It can also work as a PURGE HUB to send a PURGE reuest to multiple hosts when a key update is being done.
File "/usr/sbin/sblocker_http_hub"
Systemd service: sbhub
config file: /etc/sysconfig/sbserver.service

SquidBlocker caching

Since SquidBlocker is an http service there is an option to use squid, varnish or nginx as a reverse cache proxy. It doesn't support "IMS" validation and returns always a realtime response. The HUB can help to send PURGE, PUT or GET messages to multiple cache hosts.

SquidBlocker squid client

I am providing a squid external ACL helper that support concurrency and works with the next settings:

external_acl_type filter_url ipv4 concurrency=50 ttl=3 %URI %METHOD %un /usr/bin/sblocker_client -http=http://filterdb:8080/sb/01
acl filter_url_acl external filter_url
deny_info http://<SOME SERVER NAME OR IP>/block_page/?url=%u&domain=%H filter_url_acl

acl localnet src 192.168.0.0/16

http_access deny !filter_url_acl
http_access allow localnet filter_url_acl

SquidBlocker hammer/updater

SquidBlocker "Hammer" runs batch updates to the DB server. It was written due to the fact that it runs updates in 5% of the time of that with a single request per key update. It uses a "PUT" request to black or whitelist a list of domains urls or tcp ip:port list. An example of usage:

 * For domains blacklist
/usr/bin/sblocker_hammer -f="BL/porn/domains" -http="http://127.0.0.1:8080/db/put_batch/?val=1&prefix=dom:" -test="http://127.0.0.1:8080/sb/01/url/?url=" -t="http://block.test.example.com/?test1=1"

 * For urls blacklist
/usr/bin/sblocker_hammer -f="BL/porn/urls" -http="http://127.0.0.1:8080/db/put_batch/?val=1&prefix=url:http://" -test="http://127.0.0.1:8080/sb/01/url/?url=" -t="http://block.test.example.com/?test1=1"

 * For IP:port blacklist
/usr/bin/sblocker_hammer -f="BL/porn/ipport" -http="http://127.0.0.1:8080/db/put_batch/?val=1&prefix=url:tcp://" -test="http://127.0.0.1:8080/sb/01/url/?url=" -t="http://block.test.example.com/?test1=1"

A note about securing the DB service.

A little more about SquidGuard

SquidGuard is an ACL helper that uses BDB files and the "key" to block or allow a domain or a url path(without query terms). It runs a series of lookups and in many cases on a bunch(10+) DB bdb files.

A little note about BDB: It is an embedded database for key/value data. It means that for a programmer it is a library that allows you to work and write code with files in a specific format and API. The last time I tried ruby and other languages APIs it was based on the c binding and didn't allowed me to run write operations in concurrency. I understood it works with a lock mechanism which should allow some concurrency for reads.
Who is using it?: Many!!(openLDAP, RPM, memcachedb) The size limit of a DB file is from 2TB to 256TB and is depend on the DB page size.

Realistic facts about blacklist categorizing

SquidGuard username acls hanling

SquidGuard is querying the DB and saves the credentials for a specific ammount of time defined by the user.

About URL Testing algorithms

SquidGuard

SquidGuard basic tests are done first on domains and then on the path. In the domain db if there is a upper level domain present it's a match. It means that if the db contains gambit.com then both www.gambit.com, gambit.com and www1.gambit.com will be a match but not testgambit.com. In the path SquidGuard cut's the takes if there is a www in the url it first strips it and then run a series of test against the DB. The url lookup tests for the file and then reversibly the path until it gets to the root path and stops. The path lookup divides into two, with "www." at the end(closer to the scheme) of the domain name or without, which means that urls in the db that starts with "www." are meaningless.

An example would be "http://www.example.com/test1/1.jpg?arg=1". SquidGuard first will convert it to "example.com/test1/1.jpg" and then test for a full match which is "example.com/test1/1.jpg" and then backwards towards the root path lookup to see if there is a match for blacklisting resulting in the next lookup "example.com/test1/" and later "example.com/". If the url would not contain "www." but other domain such as "http:/test.example.com/test1/1.jpg?arg=1" would be tested for "test.example.com/test1/1.jpg" and backwards "test.example.com/test1/" ,"test.example.com/". So no port no scheme , no "www." and no query terms in the url DB. I do not know how SquidGuard handles regex lists so it's out of the scope of this doc.

Cisco uses a similar approach in their filtering products of meraki.

Path lookup alternative.

SquidGuard tries to be effective and fast as possible for blacklists and there for doesn't take in account many things. Since it is not using scheme port and query there are couple scenarios which it cannot handle well. There are places which requires more then just a blacklist system and needs a system which will match full urls for a whitelist or blacklist. An algorithm for these cases requires another lookup approach which I have implemented. The algorithm is a "PATH reverse lookup only" which means that it will lookup for a full url(with query terms full host and port) and then test backwards towards the root path of the url. For example "http://www.example.com/test1/1.jpg?arg=1" will be tested(first match) for:

And in a case of a port present it will not be stripped. An example for for a port preset case would be the url "http://www.example.com:8080/test1/1.jpg?arg=1" which would be tested for:

The leading slash "/" is an issue by itself since it's being used and is required. The path lookup is based on this that there is always a root path "/" present in the uri as this is it's structure. Also it takes in account that there is full match test before reverse testing the path. It means that to match a full path it requires the leading "/" in the path to be stripped. This effects that way we store urls in the DB.

Meraki algorithm (from their docs)

Whenever a device on the network accesses a web page, the requested URL is checked against the configured lists to determine if the request will be allowed or blocked.

Pattern matching follows these steps in order:

  1. Try to match the full URL against either list (blocked vs whitelisted patterns list)
  2. e.g., http://www.foo.bar.com/qux/baz/lol?abc=123&true=false
  3. Remove the protocol and leading "www" from the URL, and check again:
  4. e.g., foo.bar.com/qux/baz/lol?abc=123&true=false
  5. Remove any "parameters" (everything following a question mark) and check again:
  6. e,g., foo.bar.com/qux/baz/lol
  7. Remove paths one by one, and check each:
  8. e,g., foo.bar.com/qux/baz, then foo.bar.com/qux, then foo.bar.com
  9. Cut off subdomains one by one and check again:
  10. e.g., bar.com, and then .com
  11. Finally, check for the special catch-all wildcard, *, in either list.

If any of the above steps produces a match, then the request will be blocked or whitelisted as appropriate. The whitelist always takes precedence over the blacklist, so a request that matches both lists will be allowed. If there is no match, the request is subject to the category filtering settings above.

My algorithm of choice = Path + domain blacklist lookup alternative.

Regular urls:

To allow more flexibility for a more strict environments there is a way to run the lookup for a full match first and then recursively test and when reaching the domain, test the domain vs the domains black list. Such a lookup path for "http://www.example.com:8080/test1/1.jpg?arg=1" would look like this:

It's a "slow" lookup but it is the most resilience algorithms.

For IP based hosts there is only one lookup in the domains blacklist.

tcp/CONNECT IP/DOMAIN + port:

CONNECT method for tunneling connections is using a destination ip or domain + port which is another side of the algorithms. SquidGuard takes in account only the domain\IP level of the issue and there for doesn't fit for many environments which the proxy needs to be more resilience and more resilient. Which means it needs to either block all and allow from a list or allow all and block very specific tcp services. which means that there is no way to allow or block some of the ip services. This is basically a more firewall level issue but a squid proxy needs it. The number of checks are two:

V6 addresses(tcp and urls) handling

There are couple places that an IPV6 address can be present:

The RFC states(somewhere..) that an IPV6 url should be represented with square brackes around the IPV6 host such as:

Squid recieves this and send it in an escaped format such as in:

I do not know if it's a bug or this how it's to support helpers but this is how it is now. So to hanlde this I am converting the escaped format(which is not as the url standard insturcts) into the real URL such as in:

and then testing it against the DB. This is due to the fact that I cannot unescape the whole URL because it might contain escaped values which needs to tested as they are. For a CONNECT request squid does the same thing and converts the square brackets into an escaped format such as in:

Again I am converting the escaped brackets and represents and "IPv6:PORT" in the form of "[IPV6]:port" and this is the way I test it. Every IPV6 address is being converted and being stripped of unneeded zeroes and uses two colons to strip a whole zeroes space if possible by the standards.

DB formats

SquidGuard

SquidGuard uses two different lists with different charactaristics in mind. In a way to match SquidGuard lists logic I am using two schemes:

Path

SquidBlocker interfaces

* Things to be done:
- I need to change it so that the base64 decoding will be enabled using a "base64=X" query_term instead of direct interfacese.

- Database structure:
- using dom: scheme to differ from urls and domain variable
- using url: scheme to differ urls from main db and domain variable
- using user_weight: to store each user allowed weight
- using group_weight: to store each group allowed weight

- Tests to run
- Test urls(push throw a ipv6 squid and see what happens):
list of test urls:
http://213.151.33.10
http://213.151.33.10/

http://213.151.33.10:8080
http://213.151.33.10:8080/

http://213.151.33.10/path/test1/1.jpg
http://213.151.33.10/path/test1/1.jpg?var1=1&var2=212

http://213.151.33.10:8080/path/test1/1.jpg
http://213.151.33.10:8080/path/test1/1.jpg?var1=1&var2=212

http://213.151.33.10/?var1=1&var2=212
http://213.151.33.10:8080/?var1=1&var2=212

-----

http://www.example.com
http://www.example.com/

http://www.example.com:8080
http://www.example.com:8080/

http://www.example.com/path/test1/1.jpg
http://www.example.com/path/test1/1.jpg?var1=1&var2=212

http://www.example.com:8080/path/test1/1.jpg
http://www.example.com:8080/path/test1/1.jpg?var1=1&var2=212

http://www.example.com/?var1=1&var2=212
http://www.example.com:8080/?var1=1&var2=212

-----

http://test1.example.com
http://test1.example.com/

http://test1.example.com:8080
http://test1.example.com:8080/

http://test1.example.com/path/test1/1.jpg
http://test1.example.com/path/test1/1.jpg?var1=1&var2=212

http://test1.example.com:8080/path/test1/1.jpg
http://test1.example.com:8080/path/test1/1.jpg?var1=1&var2=212

http://test1.example.com/?var1=1&var2=212
http://test1.example.com:8080/?var1=1&var2=212

-----
tcp://host:port
tcp://host:90

tcp://host:90/dddd

----- IPV6 tests
http://[2001:41c8:20::5002]:443/
https://[2001:41c8:20::5002]/
http://www.internetsociety.org/sites/default/files/styles/homepage_highlight/public/field/homepage_highlight/GIRGRAPHICHOMEPAGE_0.jpg?itok=9qszdnEV&c=a6640bf3b0548b29e3059d6ef86fbc16
-- END OF test url links

[X] There is a need to test if squid sends port 
- it sends ip:port or domain:port

- Change to use "net.SplitHostPort" instead of manually parsing and using the error set the host and port.

[X] NEED to test hammer version 7 
[X] Squid Conf example for the client

* A list of http interfaces
[ ] /control (auth)(auth can be done using a reverse proxy)
[X] /control/dunno_mode 
[X] /control/dunno_filp
[X] /control/db_stop
[X] /control/db_start
[X] /control/db_status

- UI path
[ ] /ui/

- SquidGuard search algorithm as IF there is one big blocklist with both domains and pathes
[X] /sg/domain 
[ ] /sg/path_only
[ ] /sg/url
[ ] /sg/url_01

[ ] /meraki/url (one big blacklist)
[ ] /meraki/url_01(mixed white and blacklist)

* Needs to fix the issues with trailing "/" for couple test cases
[X] /sb/01/url (needs a better testing)
[X] /sb/01/tcp
[ ] /sb/url_nolist
[ ] /sb/url_nolist_01
[ ] /sb/urlwithdomlist_path
[ ] /sb/urlwithdomlist_path_01
[ ] /sb/dom_bl_01
[ ] /sb/tcp_ip_port_01 (uses the "ip:port" format or can be used with a CONNECT scheme from a uri and the DB storage can be tcp:ip:port)
[ ] /sb/safe_search_force/url/

[ ] /sb/weight_url_by_bar
[ ] /sb/weight_url_by_user (bar stored in the db under user_weight: prefix)
[ ] /sb/weight_url_by_group (var bar stored in the db under group_weight: prefix)
[ ] /sb/weight_dom_by_bar
[ ] /sb/weight_dom_by_user (bar stored in the db under user_rate: prefix)
[ ] /sb/weight_dom_by_group (var bar stored in the db under group_rate: prefix)

[ ] /sb/first_match_path1
[ ] /sb/first_match_path2
[ ] /sb/youtube/id/ (block or allow by video or image id)
[ ] /sb/youtube/user/


- The /db/get* are with a query term key
[X] /db/get (by key),(url.Unescape by default)
[X] /db/get_base64 (by key)

- all the /db/put* are with query term val and prefix, and needs debug sections.
[X] /db/put (url.Unescape by default)
[X] /db/put_batch (with and without prefix)

[X] /db/set (url.Unescape by default)
[ ] /db/set_dom (url.Unescape by default)
[ ] /db/set_user (url.Unescape by default)
[ ] /db/set_url
[ ] /db/set_uri
[X] /db/set_base64

[X] /db/del (url.Unescape by default)
[X] /db/del_batch (is there prefix compatibility?)
[X] /db/del_base64
[ ] /db/distribute_key/tohost
[ ] /db/fullsync/tohost (should be done with rsync on dunno mode)
[ ] /db/peers

- Tests to run
/sg/X
[X]domain 
[ ]/sg/path_only_base64 

/db/X
[ ]insert key + val using(put or set)
[ ]fetch key (using get)
[ ]del key
[ ]put batch

TODO list

 [X] add a sysconfig options for the sbhub
 [X] fix the sbhub default server listen to localhost:8081
 [X] fix the sbhub default server peer to localhost:8080
 [X] fix the nasty bug of the hub closing before fetching the request.
 [X] Adding to TCP search another point with "\*:port" before "\*"
 -[ ] Gather route polcies for intercepting port 80
  -[ ] Cisco
  -[ ] Juniper
  -[X] Vyos
  -[ ] Mikrotik
  -[ ] EdgeOS (ubiquity, should be similar to Mikrotik)
  -[X] Linux
  -[ ] FreeBSD PF(will not be implemented using ipfw)
  -[ ] Others()

License

Copyright (c) 2015, Eliezer Croitoru All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.