Page Created: 7/31/2014   Last Modified: 9/3/2018   Last Generated: 12/10/2018
(This page is an extremely rough draft and is full of all kinds of errors. I will try to improve the documentation over time if I release future versions. Please note that this page was originally generated in HTML. If you are reading this as a text README file inside the source tarball, it will not contain the example hyperlinks.)
ScratchedInTime is a Perl based, FastCGI commenting, contact form, and blogging system with cryptographic ID, remote monitoring and control and knowledge captcha. It integrates with Memcached, Bogofilter Bayesian spam filtering, and XMPP. It can integrate with the ScratchedInSpace static site generator, and the blogging system integrates Textile markup and will auto-link CamelCase and #hashtags and provide external link indication. It also provides an Atom feed.
Required Files and Directories (need to be in same folder as comments.pl):
ScratchedInSpace is required for initializing variables and paths, creating spam and comments ring buffers and loading captchas into Memcached. If ScratchedInSpace is not used as a static site generator for the site, it can be used with the "memcached" parameter which only loads Memcached for use by comments.pl.
comment.tmpl - HTML template for displaying comments.
entercomment.tmpl - HTML template for entering comments.
blog.tmpl - HTML template for the blog page.
/comments - Folder outside of web server path for storing raw comments. The config.pl file from ScratchedInSpace must also be copied here.
/bogo - Folder outside of web server path for storing bogofilter database and data files (tmpfs recommended).
- curl (for sending POST commands from command line)
- exiftool (used by publish.sh to strip exif tags)
- rsync (used by publish.sh to upload files to public web server)
- ssh (used to execute commands on public web server)
comments.pl - This single FastCGI script handles both the comment input and display. It loads data into Memcached on Cache for use by ScratchedInSpace and OswaldBot and also uses Memcached as a persistent variable store for the FastCGI script.
If a page is named YellowBird, for example, a link to the comment page can added to that page as follows:
The Comment plugin in ScratchedInSpace can also simplify this process.
The comments page is the same page name ending in "Comments", such as YellowBirdComments and must be created in the /comments folder before the comment link will work.
If the page name is not the name of a comments page but is called "private", such as:
Then a private contact form will be generated instead. The output of this contact form will be stored in a memcached key named after environmental variable called PRIVATECOMMENTKEY which is set in lighttpd.conf on the lighttpd server. This key saves private comments in memory only which can be read by the ScratchedInTimePlugin by OswaldBot.
Edit Page Relay
It also relays requests by the ScratchedInSpace system to prompt its user's browser to open an external editor. If it receives an edit request in the form of
http://servername/comments/comments.pl?edit=ThisIsMyPage it will send a short page (with a MIME context type of application/x-ScratchedInSpace) containing only the page name (i.e. ThisIsMyPage) which cannot be longer than $maxcommentlength set in config.pl minus 8 (to account for the length of the "XForward" flag on remote X-Forwarding style edits). See ScratchedInSpace for more details.
Meta generator tags
By default, the "meta generator" tag within the HTML on the pages is set to ScratchedInTime. These tags can be removed if needed by editing the comment.tmpl, entercomment.tmpl, and blog.tmpl template files before running the comment system.
The running FastCGI script is controlled through the use of HTTP POST commands. It is recommended to send these commands within a secure firewall only, as POST is not secure for using over the Internet.
curl --data "command=[command]&commandpassword=[commandpassword]" http://[url]/comments/comments.pl
The key for parameter "commandpassword" must match the value stored in the environmental variable set in the lighttpd.conf file on the lighttpd server.
POST Commands can also be sent from OswaldBot.
blog [blog text]
S|C|H|U|B [Comment page] [Comment!#1 Comment!#2 ...]
Any page in a ScratchedInSpace site can have a comments page, but it first has to be created. So, for example, to add a comments page for YellowBird, a file named !!YellowBirdComments would need to be added to the the /comments folder. Each comment file has to be the same name as the file on which it is commenting, ending in "Comments".
The easiest way to do this at a terminal is to send a command directly to the server such as: ssh -c "touch /comments/YellowBirdComments".
Then the Comments server must be told to generate the HTML page by sending a "generatecomments" command to the server, which generates the non-spam pages and blog and also adds them to Memcached.
Now the empty comments page has been created, but there is no link to that page, so nobody will be able to find it.
On the static page, the ScratchedInSpace Comment macro needs to be added which will create a link to this comments page.
When a person leaves a comment, it is saved to the appropriate Comments page in the /comments folder.
If anything happens to the /comments folder, the comments are lost. So periodically, it is recommended to run something like rsync or rdiff-backup to transfer backups of this folder to another computer.
A knowledge captcha was implemented instead of a visual one. The questions and answers are stored in a file called "captcha.data" which sits in the same directory as ScratchedInSpace.pl. When ScratchedInSpace.pl is run, it loads Memcached with the questions and answers. Then comments.pl randomly picks one for each comment.
Security checks are performed which looks for:
- Form incomplete. Name or Comment is blank.
- Name too long. It can only be up to $maxcommentlength.
- Page name too long (ScrachedInSpace edit page names). It can only be up to $maxcommentlength.
- Passcode too short. It can't be shorter than the random passcode generated.
- Spaces found in name. The system allows first names only.
- Wrong answer. The answer to the captcha is incorrect. This can also occur if someone attempts to reuse the same captcha after a submission.
- Bad IP. The IP address used when the form was generated is not the same IP address that submitted the form.
- Missing key. A random session token is generated at the time the comment form is generated. The received comment must include this token.
- Bad parameter. The field in the HTTP query string is invalid.
After a successful submission, a tarpit routine is activated which prevents comments from the same IP address to be entered for 60 seconds.
The tarpit keeps track of the time in Memcached, a way of keeping persistent variables outside of FastCGI. It expires the IP address so no personally identifiable information is captured very long, and it is not written to disk (unless someone writes it in their comment)
- Page full. The max limit of comments on that page has been reached.
- Page locked. The comments page has been locked by the administrator.
- Names and comments are stripped of any non-English letters except for , . ! ? - which prevents any URLs from being directly entered.
After incoming comments are received and sanitized, they are passed through Bogofilter. It runs on underlying Arch Linux, and the script calls it. Bogofilter is directed to use /bogo ramdisk to save its BerkleyDB database and the spam and ham "corpus" and temp file.
Command "primebogo" delete the database and re-primes bogofilter with the latest spam and ham, creating a new database.
Since it is already touching all comments pages to rebuild the spam database, it also deletes and recreates "NEWUSER-" keys from memcached and reloads them with new names and ID's to rebuild the Memcached newuser flag database.
Comments are classified as one of the following:
- S - Spam. Marked as Spam but not added to spam corpus
- C - Spam Corpus. Marked as spam by the administrator.
- H - Ham. Marked as ham and added to corpus
- U - Unclassified. Bogofilter never made a decision. Not added to corpus.
- B - Blocked. Is not spam, but removed by administator, for some reason.
Primebogo first searches all Comments pages and separates them, saving the spam corpus © in the bogospamcorpus.data file and the ham (H) in the bogonospam.data files on the ramdisk. It also counts the number of good messages per page.
It then rebuilds the bogofilter database using these data files.
The system temporarily stores a single new comment in bogotempdatabase.data so it can feed it to bogofilter to classify it.
When bogofilter classifies it, if it is Spam, the system will mark it as spam, but not add it to the corpus, and if it is Ham, the system will add it to the corpus. If it is Unclassified, the system won't add it to the corpus but will still publish it. This means that false positives will cause a feedback effect on the bayesian filter if the good comments aren't monitored and spam that gets through is marked as such. But this is better than false negatives, where people's comments are blocked and are never seen.
After bogofilter classifies a comment, it is added to the comments page, but only the non-spam comments are rendered to the public web server and added to Memcached so the Static Nginx server can pick it up.
Any spam is added to a circular "ring buffer" created using Memcached.
The Spam Ring Buffer
To monitor incoming spam (to make sure bogofilter is working) the most recent spam is added to memcached, and the oldest spam is removed if they exceed a certain amount. This was to keep the memory usage low and make it easier to manage. Instead of shifting around the memory, a pointer was created that moves in a ring and simply adds the new spam in the correct Memcached slots (like moving a pointer around in an array).
Using Memcached for this took a burden off the CPU of the comments server, since recent spam does not have to be generated from the pages. The ringbuffer only works for new spam and is not regenerated if Memcached gets overwritten or goes down. It is very ephemeral, which is how it should be, since it is just spam, unless there is an issue. It also makes it very easy for OswaldBot to access it.
To view this ring buffer, the "showmespam" command is used, which displays the spam as XMPP on an XMPP client such as a mobile phone, numbering them with their "REZNUMBER", their location in the ringbuffer.
If the ring buffer spam includes a good non-spam comment (ham) that bogofilter incorrectly classified, an "R [REZNUMBER]" command should be sent to comments.pl to resurrect that message as Ham. It also adds it to the Ham corpus.
The latest non-spam comments are generated as an Atom feed (recentcomments.atom). This allows a feed reader to quickly view them. A "generatefeed" command will render a new feed from the latest good comments stored in its own ring buffer in Memcached. It does not pull the latest good comments from disk at all, using the ringbuffer to take burden off of cpu and disk.
If the Atom feed or a comment on a page should not be visible or is spam, a "[classification] [pagename] [comment!#1] [comment!#2] ..." command can be sent to reclassify the comment as the new classification.
So "S YellowBirdComments 4 5 6" would reclassify comments 4,5, and 6 on the YellowBirdComments page as Spam, but not add them to the corpus. They would be immediately removed from the page.
When spam is removed from a Comments page, it doesn't completely disappear but the system leaves an invisible placeholder in case the comment needs to be put back, so the comments retain their chronological order.
Comment count and Lock status
At the top of each Comments page is the comment count. To the right is the lock status.
So for example, a comment page may have at the top "100 LOCKED" which means there are 100 good comments and the administrator lock has been applied. The reason the count is stored on the page is because the page holds all comments (spam and non-spam). It needs to keep the spam to use as a spam corpus, if the bogofilter database needs to be regenerated. And fetching and incrementing this number after each new good comment was a way to prevent the server from having to sort and count the comments each time. This count is displayed on the comment page for viewing, and allows a max comment limit to be added to that page, instead of using file size limits. A file size limit would not be accurate if most of the comments were spam, eating up all the space, with few good comments, so a max comment limit seems better.
If a "blog [blogtext]" command, it will send the blog text over to comments.pl to create an instant blog on the blog page. Textile lightweight markup language can be used. If a word is typed as camelcase or #hashtag format, it will auto-link to the relevant page on the static site. Note that all hashtags on the blog are non-anchored, slightly different behavior than ScratchedInSpace.
If an external link is written in the blog using Textile markup, a ↗ appears to the right of the link. This is a northeast arrow icon (diagonal) which corresponds to unicode U+2197.
The blog entries also contain hidden HTML anchors that correspond to the visible time and date stamp, with spaces and colons converted to underscores. This allows linking to those blog entries from the static site. For example, if the blog timestamp is "Fri Aug 1 05:18:53 2014", this entry can be easily hyperlinked by using http://servername/MyBlog#Fri_Aug__1_05_18_53_2014 as the URL. If using the ScratchedInSpace generator for the static site, it includes a Bloglink plugin that will automatically convert the spaces and colons to underscores and link to the correct page, making it a simple matter of copying/pasting the date/time value. The problem with manually creating a hyperlink is that if the blog is archived after a year passes, for example if MyBlog is ever renamed to MyBlog2015, the link will break. The Bloglink plugin will automatically account for this, and will assume that MyBlog is the blog for the current year, and will update the link for past years accordingly.
Cryptographic ID system
Made possible due to modern cryptography, if a person wants to indicate that they are the same person that left a previous comment, they can write down the passcode they previously used and use it again. See CommentSystem for more info.
When a comment is created, a 3-byte octet is generated from /dev/urandom random number generator (24-bits) and converted to a 4-character base64 string. This is the passcode that is shown on the comment form. This means there are 224 or 644 possible values, or 16,777,216. It is very unlikely that two different people with exact same name will have the same 4-character code.
This id code is combined with their name and a 512-byte "pepper" to generate a 512-bit hash or digest using SHA3 (keccak) algorithm.
Name + Passcode + Pepper --> SHA3 = HASH
The Sliding Window
Then a "sliding window" of 5-characters is moved from left to right across this hash beginning at the length of the name + 5.
The pepper is global, unlike a salt, so it has to be secret. For lighttpd, it is loaded as a CGI environmental variable by /etc/lighttpd/lighttpd.conf when lighttpd is started (other web servers have a different method of assigning enviromental variables). This keeps it out of the Perl code in case it is compromised. The Perl program fetches it from ram.
If lighttpd is used, something like this needs to be added to /etc/lighttpd/lighttpd.conf (in Arch Linux):
setenv.add-environment = ( "PEPPER" => "[HASH]", )
... where [HASH] is simply the result of "head -c 513 /dev/urandom | base64"
Benefits of this method:
- No password or personally identifiable information is stored unless the person decides to write personal information in the public comment, so a breach doesn't harm anyone.
- A brute force attack on 4-character passcode is infeasible because:
- Users don't see the full hash, only a window.
- Using a different name of same length and key is of no help to get entire hash since the entire hash string changes if the name is different.
- An attacker can't just rely on 5-character output since the server creates a delay before responding which slows an attack to make it infeasible. Rainbow tables cannot be generated.
- If an attacker did find a collision with a 5-character window, it would probably only work for that one user, but they would not have the global pepper which is enormous (512256 or around 10693)
- Even with millions of users and data points, it is infeasible since 10693/106 is still a huge number.
- A 5-character base64 has 1,073,741,824 possible values, so innocent conflicts are unlikely. If it occurs, the window could be increased to 6-characters.
- If a user's passcode is compromised due to another factor, such as plain text over Internet or someone getting a hold of it, the original person can just flag that ID as compromised. This can't be blocked since there is no password they can reset. That is the beauty of keeping a site simple and without managing accounts and passwords.
- The 4-character passcode is easy to remember and type.
- The 5-character ID is easy to see for visual comparison.
Weaknesses of this method:
- A user can override the random passcode with a weaker, non-random one.
- A pepper breach is global.
- Strange 4 or 5-letter words may randomly appear in passcode.
- There is no protection for passcode over Internet--it is plain text. (Implementing a key-exchange mechanism was not worth it in this case.)
New user flag
When person leaves a comment and their name and id have never been used before, the system marks them as " NEW". Instead of searching all the pages in the comments to see if that person's name and ID previously existed, which is cpu and disk intensive, it checks Memcached NEWUSER-[Name] key for that name to see what IDs it contains (since several people can use the same name but have different IDs). If the ID exists, the person is not new. If the ID doesn't exist, the new ID is added to that key and the person is considered new.
To load this new user information into Memcached, the "primebogo" command will do it, since the newuser load just piggybacks on it. It wasn't worth it to write a separate function for it, since primebogo was already reading all the comment pages.
When a person's passcode is compromised, they can flag it as compromised and all comments from them from that time and back, are marked as " COMPROMISED". It doesn't stop people from continuing to use the name and passcode after that time, but it at least creates an indicator to alert others that the comments from that time and before cannot be trusted as being from the same person.
If the person is a New user, someone cannot mark the account as compromised, since that doesn't make sense as there is only one comment.
OswaldBot integrationPage Created: 7/23/2014   Last Modified: 3/11/2016   Last Generated: 12/10/2018
ScratchedInTimePlugin.py - Needs to be in the same folder as OswaldBot.
ScratchedInTime doesn't need a control server, but I added a plugin called ScratchedInTimePlugin so that OswaldBot can send certain commands to it so I could send text commands (XMPP) from a smart phone running Xabber to the system to do things like send blogs, view most recent spam, view private comments, mark certain comments as spam, resurrect spam as valid comments, rebuild comments pages, and prime the bogofilter database.
Since I didn't add a user moderation system, spam is only controlled algorithmically unless I intervene. I could build a moderation system, but that would just add complexity, which I am keeping to a minimum. But a manual method only works if you put upper limits on the amount of comments per page. Otherwise the amount of spam would eventually overwhelm a single person.
To receive output from ScratchedInTime, it checks Memcached running on the Cache server. OswaldBot is not directly accessible by the Comments server for security reasons. To receive output from comments.pl, it checks Memcached running on the Cache server. It is a little tricky using the same Memcached server with both Perl and Python, since the software implementations are different, which can cause problems in Memcached if you're not careful.
set [command] - Sends a free form POST command to the comments server.
showmespam - Shows the latest spam flagged, along with the "resurrection" number.
showmeprivate - Shows the latest private comments (contact form), appends them to a file, and deletes them from Memcached.
The POST commands normally have to be prefixed by the word "set", except for "showmespam" and "showmeprivate" which were added for convenience.
There are probably all kinds of bugs in it.
Warning, this project is experimental and not recommended for real data or production. Do not use this software (and/or schematic, if applicable) unless you read and understand the code/schematic and know what it is doing! I made it solely for myself and am only releasing the source code in the hope that it gives people insight into the program structure and is useful in some way. It might not be suitable for you, and I am not responsible for the correctness of the information and do not warrant it in any way. Hopefully you will create a much better system and not use this one.
I run this software because it makes my life simpler and gives me philosophical insights into the world. I can tinker with the system when I need to. It probably won't make your life simpler, because it's not a robust, self-contained package. It's an interrelating system, so there are a lot of pieces that have to be running in just the right way or it will crash or error out.
There are all kinds of bugs in it, but I work around them until I later find time to fix them. Sometimes I never fix them but move on to new projects. When I build things for myself, I create structures that are beautiful to me, but I rarely perfect the details. I tend to build proof-of-concept prototypes, and when I prove that they work and are useful to me, I put them into operation to make my life simpler and show me new things about the world.
I purposely choose to not add complexity to the software but keep the complexity openly exposed in the system. I don't like closed, monolithic systems, I like smaller sets of things that inter-operate. Even a Rube Goldberg machine is easy to understand since the complexities are within plain view.
Minimalism in computing is hard to explain; you walk a fine line between not adding enough and adding too much, but there is a "zone", a small window where the human mind has enough grasp of the unique situation it is in to make a difference to human understanding. When I find these zones, I feel I must act on them, which is one of my motivating factors for taking on any personal project.
Here is an analogy: you can sit on a mountaintop and see how the tiny people below build their cities, but never meet them. You can meet the people close-up in their cities, but not see the significance of what they are building. But there is a middle ground where you can sort of see what they are doing and are close enough to them to see the importance of their journey.
The individual mind is a lens, but, like a single telescope looking at the night sky, we can either see stars that are close or stars that are much farther away, but we can't see all stars at the same time. We have to pick our stars.
I like to think of it like this:
Source code can be downloaded here.Comments