Only recently I’ve discovered the power of ‘re’ the python regular expression library. Instead of writing long functions that process text character by character to add or remove stuff, you use re, write and expression in regex that achieves what you want and basta! in a few lines things get done.
For example the following function will remove any html tags (preventing Cross Site Scripting) and escape the rest of whatever the user types in:
# Remove html tags and escape the input def scrapeclean(text): ----# This matches open and closing tags and what's between them ----x = re.compile(r'<[^<]*?/?>') ----# Replace to nothing using sub and escape what's leftover and return the result all in one line! ----return cgi.escape(x.sub('',text))
Remove the dashes when you copy the code – they were added to show the necessary indentation. And for full disclosure : I took the compile statement from the following site (I’m not a regex expert).
So you can call this function from somewhere in your python code and the result will be ‘scraped clean’ of all tags beginning with < and ending with > plus any ampersands other other special characters get to be ‘escaped’.
YMMV – this is very likely not a complete protection against all the things a hacker can input in your website, but it’s certainly a start.