Helpful Information
 
 
Category: Server side development
Special Character Formatting

Does anyone know how to strip Word or other word processing formatting from an HTML text area? I can use the Replace function for each ASCII character but there has to be an easier way.

Whenever I copy some text from HTML etc I usually paste it in NotePad and then copy again to clip board from notepad.

This strips unwanted formatting for me.

Hope this helps.

I actually do the same thing JoeP does... seems the quickest way to me without stripping stuff you don't want.

Unless you need to replace the stuff when retrieving it from a file dynamically - in which case you will want to use the Replace() function. If the latter is the case, I agree with Dave - have any examples?

Here's the basic idea though:



myString = Replace(Replace(Replace(Replace(myString,"<",""),">",""),chr(34),""),chr(39),"")


Which would replace <,>,",and ' with nothing in the above example.

:)

Trying to strip Word formatting from a cut and paste before it gets to the database. The formatting could be tabs, symbols, international alphabet characters, etc. Anything that could be cut and pasted into a text area from a word processor.

Instead of trying to identify all the unwanted characters, as in the above example using replace() identify the one's you do want instead. It's much easier to define what you want than trying to define all the other possible characters that you don't want.

Yeah... maybe instead of using Replace(), you could also just use a regular expression that contains the characters that are acceptable to you, and match the whole string against that, like:

myRegExp = new RegExp

With myRegExp
.Pattern = "\w\s"
.IgnoreCase = true
.Global = True
End with

If myRegExp.test(MyString) = False Then
myStringError = True
End If

I haven't tested that...


Or, using another method (not NEARLY as elegant), you could make a string of characters that are acceptable, like:

myAcceptableCharacters = ".|,|A|B|C|D|"

etc...

And loop through the string you're checking to see if the current character is in the string (say using a variable like CurrentCharacter), like:

If InStr(myAcceptableCharacters, CurrentCharacter) = False Then MyError = True

Heh... that's definitely typical "Word" HTML formatting. YECH.

HTML TIDY (or the plugin HTML TIDY that comes with HTML KIT) claims to strip all of the "Word" formatting from a WORD-->HTML page, but from what I've seen it strips almost everything, lol.

I'm not sure how to overcome the obstacle of someone potentially pasting "Word" characters in a textarea, without using a regular expression or function of some sort.

You might be better off, if you're not comfortable using regular expressions, to let them know they need to paste from NotePad?










privacy (GDPR)