Forums - Regular Expressions

Forums (http://www.abandonia.com/vbullet/index.php)

- Programming (http://www.abandonia.com/vbullet/forumdisplay.php?f=25)

- - Regular Expressions (http://www.abandonia.com/vbullet/showthread.php?t=8946)

Me neither. There's an excellent tutorial on finite state automate on this website though. And check out the wikipedia links on Pushdown automaton. Might get you further along the way. It seems a bit over the top though for a simple pattern matching operation, or am I being naive here?

It's a bit more than pattern matching. It's case insensitive pattern replacing with filling up lacking parts of a pattern. Let me explain it more concrete, with the example of a guestbook.

In the guestbook, users can use BBcode (the []-tags you can use on forums as well). It's quickly done by using str_replace(), which'd replace all BBcode tags with their corresponding HTML in a given string (which's the user's input, in this case).

Now take this case:

Code:

User_one: Foo [b]bar

If I'd use str_replace(), all text after the tag'd be bold, until I'd accidentally use a [/b] somewhere. User_two, User_three and User_one's next post all'll be bold.

What the guestbook should do instead, is check if there're less closing tags for each tag than opening tags, then add the closing tags at the end of the input, so that it'll look like this (in a case of bad user input):

Code:

User_one: Foo [b]bar [b]Foo[b]bar[/b][/b][/b]

After that, I can do a str_replace on the tags (still's a bit more complicated for url-tags, but that still's possible to do. Already got it to work)

If that works, it'll be good, but it'll still not catch all cases of input. It'd miss this, for example:

Code:

User_one: Foo[B]bar [b]Foo[b]bar[/b][/b]

It'll add only two closing-tags at the end, for the simple reason that it won't match the capital B. For simple, one-letter tags, it's still doable, though. In cases like three-letter tags, like the URL-one, it's too cumbersome.

That's what the problems and purposes are. It's more than just a pattern matching, due to the possible adding and the definite replacing. Regular expressions seem to be providing horrible code for this. That or I didn't code it well :angel:

Quote:

Originally posted by Reup@Jan 26 2006, 04:54 PM
Might get you further along the way. It seems a bit over the top though for a simple pattern matching operation, or am I being naive here?

Naive. One of those methods (FSM, PDA, LBA, etc) is standard in any kind of parsing. The key idea here is what most people don't seem to understand: parsing != pattern matching. This is the primary reason why most BBCode parsers and other such small HTML-related parsers (sanitizers, etc) are so fragile and error-prone: because the developers simply assumed that using some pattern matching and replacement they could kludge together an easy parser.

About a year ago I wrote an extremely forgiving, correcting, (X)HTML parser which validates against an arbitrary DTD (for custom tag support) and supports callback filters for custom rendering of elements. It's *way* more complex than what you need to do basic BBCode parsing and sanitization, but it's based on the exact same idea. However, since my implementation was correcting and supported filters it required the development of a full parse tree, which you shouldn't need. Unless you want to do complex transformations you can probably get away with doing things in-place.

Note: FSMs and PDAs are not the same thing, only similar. The stack is absolutely crucial for an HTML or BBCode parser, which is why a FSM is not appropriate.

Quote:

Originally posted by Kon-Tiki@Jan 26 2006, 05:12 PM
That's what the problems and purposes are. It's more than just a pattern matching, due to the possible adding and the definite replacing. Regular expressions seem to be providing horrible code for this. That or I didn't code it well :angel:

The problem with the case-insensitivity is easily fixed. Perl regular expressions have a bunch of modifiers available for such things. You probably want to be using at least the i and g modifiers (for case insensitive and global matching, respectively).

Another major problem with using regexps is for this is that running them across multiple lines can be a real pain (it's possible, but it complicates things).