Syntax highlighter for C++

Discussion in 'Web Design & Coding' started by X-Istence, Nov 13, 2008.

  1. X-Istence

    X-Istence * Political User

    Messages:
    6,498
    Location:
    USA
    Please see this post I posted for the most up-to-date sed script: http://forum.osnn.net/showpost.php?p=856765&postcount=8

    --

    I am a sick sick person, and I love regex's :p. I have written a rather quick and dirty C++ parser that does a few things (I am using this to create my Portfolio website). This is a sed script, which goes through the C++ files it is handed.

    It also does some other cool things, for example I wanted to embed links in my C++ source files, so that on the web page those come out, what I figured I would do is take a page from the Markdown idea.

    Code:
    {Text here}[url here]
    Allows this script to create links that are embedded in the C++ source code files. I figured I'd post it here.

    Code:
    #!/usr/bin/sed -E -f 
    
    # Remove anything HTML is going to hate us for
    s/</\&lt;/g
    s/>/\&gt;/g
    
    # Multiline comments
    
    /\/\*\*/,/\*\*\//{
    
    	# Replace the first instance with a span
    	/\/\*\*/c\
    	<span class="comment">/**
    
    	# Close the span
    	/\*\*\//c\
    	**/</span>
    
    	# We parse URL's in comments, but nothing else!
    	b url
    }
    
    /\/\// {
    	s/(\/\/ .*)/<span class=\"comment\">\1<\/span>/
    	
    	b url
    }
    
    # Pre-processor
    /^#.*/ {
    	# We add a span to syntax highlight it
    	s/(#.*)/<span class=\"preproc\">\1<\/span>/
    	
    	# We want to make links out of the headers we have written
    	s/#include \"(.*)\"/#include "<a href=\"\.\/\1tml\" alt=\"\1\">\1<\/a>"/
    	
    	# Only process URL's, nothing else!
    	b url
    }
    
    # Replace text within quotes
    s/\"([^"]*)\"/\"<span class=\"text\">\1<\/span>\"/g
    
    # We syntax highlight the standard library stuff (not sure if I want to turn this on)
    
    # s/(std::[^ (]+)/<span class=\"keyword\">\1<\/span>/g
    
    # Types
    s/(int)/<span class=\"keyword\">\1<\/span>/g
    s/(char)/<span class=\"keyword\">\1<\/span>/g
    s/(struct)/<span class=\"keyword\">\1<\/span>/g
    
    # Keywords
    s/(switch)/<span class=\"keyword\">\1<\/span>/g
    s/(case)/<span class=\"keyword\">\1<\/span>/g
    s/(default)/<span class=\"keyword\">\1<\/span>/g
    s/(new)/<span class=\"keyword\">\1<\/span>/g
    s/(delete)/<span class=\"keyword\">\1<\/span>/g
    s/(typedef)/<span class=\"keyword\">\1<\/span>/g
    s/(return)/<span class=\"keyword\">\1<\/span>/g
    s/(public:)/<span class=\"keyword\">\1<\/span>/g
    s/(private:)/<span class=\"keyword\">\1<\/span>/g
    s/(protected:)/<span class=\"keyword\">\1<\/span>/g
    s/(const)/<span class=\"keyword\">\1<\/span>/g
    s/(friend)/<span class=\"keyword\">\1<\/span>/g
    
    # This one is special, if we are not careful we also match class= from all the above <span>'s.
    s/(class )/<span class=\"keyword\">class<\/span> /g
    
    : url
    # Replace URL's {Text here}[URL here]
    s/\{([^\{]*)\}\[([^\[]*)\]/<a href=\"\2\" alt=\"\1\">\1<\/a>/g
     
    Last edited: Mar 6, 2009
  2. LordOfLA

    LordOfLA Godlike!

    Messages:
    7,027
    Location:
    Maidenhead, Berkshire, UK
    thats evil :D

    Care to explain the regexes for those of us not versred in them. Also those of us under the influence of alcohol :)
     
  3. X-Istence

    X-Istence * Political User

    Messages:
    6,498
    Location:
    USA
    Do I have too?

    Maybe later, I am working on some stuff.
     
  4. Geffy

    Geffy Moderator Folding Team

    Messages:
    7,805
    Location:
    United Kingdom
    Which ones? Most of them aren't that complicated. Can always read regular-expression.info to get a better grasp of regexp
     
  5. X-Istence

    X-Istence * Political User

    Messages:
    6,498
    Location:
    USA
    Code:
    #!/usr/bin/sed -E -f 
    
    # Remove anything HTML is going to hate us for
    s/</\&lt;/g
    s/>/\&gt;/g
    
    # Multiline comments
    
    /\/\*\*/,/\*\*\//{
    
    	# Replace the first instance with a span
    	/\/\*\*/c\
    	<span class="comment">/**
    
    	# Close the span
    	/\*\*\//c\
    	**/</span>
    
    	# We parse URL's in comments, but nothing else!
    	b url
    }
    
    /\/\// {
    	s/(\/\/ .*)/<span class="comment">\1<\/span>/
    	
    	b url
    }
    
    # Pre-processor
    /^#.*/ {
    	# We add a span to syntax highlight it
    	s/(#.*)/<span class="preproc">\1<\/span>/
    	
    	# We want to make links out of the headers we have written
    	s/#include \"(.*)\"/#include "<a href="\.\/\1tml" alt="\1">\1<\/a>"/
    	
    	# Process no more
    	b end
    }
    
    # Literal strings, we want to highlight them, but there is a catch
    # in C++ we are allowed to start a literal string on one line, and then continue it on the next line
    # this means we need to make sure we parse that correctly!
    
    /"/ {
    
    	# Are both the opening and closing quote on the same line? If so, branch to noloop
    	/"[^"]*"/b noloop
    	
    	# No, they are apparently not. This means we replace the quote with the correct span tag
    	s/"([^"]*)/"<span class="text">\1/
    	
    	# Label the loop
    	: loop
    	
    	# Output the text to stdout, as normal
    	n
    	
    	# Did we find another quote character yet? If not, we branch to label loop
    	/"/!b loop
    	
    	# Ah, we found it. Replace it with the correct span tag.
    	s/"/<\/span>"/
    	
    	# Parse only URL's in string literals.
    	b url
    	
    	# Label the noloop branch
    	: noloop
    	
    	# Search and replace the text
    	s/"([^"]*)"/"<span class="text">\1<\/span>"/g
    	
    	# Parse only URL's in string literals.
    	b url
    }
    
    # We syntax highlight the standard library stuff (not sure if I want to turn this on)
    
    # s/(std::[^ (]+)/<span class=\"keyword\">\1<\/span>/g
    
    # Types
    s/(int)/<span class="keyword">\1<\/span>/g
    s/(char)/<span class="keyword">\1<\/span>/g
    s/(struct)/<span class="keyword">\1<\/span>/g
    
    # Keywords
    s/(switch)/<span class="keyword">\1<\/span>/g
    s/(case)/<span class="keyword">\1<\/span>/g
    s/(default)/<span class="keyword">\1<\/span>/g
    s/(new)/<span class="keyword">\1<\/span>/g
    s/(delete)/<span class="keyword">\1<\/span>/g
    s/(typedef)/<span class="keyword">\1<\/span>/g
    s/(return)/<span class="keyword">\1<\/span>/g
    s/(const)/<span class="keyword">\1<\/span>/g
    s/(friend)/<span class="keyword">\1<\/span>/g
    
    s/(public:)/<span class="keyword">\1<\/span>/g
    s/(private:)/<span class="keyword">\1<\/span>/g
    s/(protected:)/<span class="keyword">\1<\/span>/g
    
    # This one is special, if we are not careful we also match class= from all the above <span>'s.
    s/(class )/<span class="keyword">class<\/span> /g
    
    : url
    # Replace URL's {Text here}[URL here]
    s/\{([^\{]*)\}\[([^\[]*)\]/<a href="\2" alt="\1">\1<\/a>/g
    
    : end
    This version is a little bit more robust, as I added some code to deal with string literals. I know of one edge case that won't properly be parsed, does anyone see it? I will point out that it is with the string literal parsing.
     
  6. X-Istence

    X-Istence * Political User

    Messages:
    6,498
    Location:
    USA
    So, yeah ... that version had a few flaws in it, which I only noticed as I pushed on towards bigger and better things. Off course.

    Code:
    #!/usr/bin/sed -E -f 
    
    # Remove anything HTML is going to hate us for
    s/</\&lt;/g
    s/>/\&gt;/g
    
    # Multiline comments
    
    /\/\*\*/,/\*\*\//{
    
    	# Replace the first instance with a span
    	/\/\*\*/c\
    	<span class="comment">/**
    
    	# Close the span
    	/\*\*\//c\
    	**/</span>
    
    	# We parse URL's in comments, but nothing else!
    	b url
    }
    
    /\/\// {
    	s/(\/\/ .*)/<span class="comment">\1<\/span>/
    	
    	b url
    }
    
    # Pre-processor
    /^#.*/ {
    	# We add a span to syntax highlight it
    	s/(#.*)/<span class="preproc">\1<\/span>/
    	
    	# We want to make links out of the headers we have written
    	s/#include \"(.*)\"/#include "<a href="\.\/\1tml" alt="\1">\1<\/a>"/
    	
    	# Process no more
    	b end
    }
    
    # We syntax highlight the standard library stuff (not sure if I want to turn this on)
    
    # s/(std::[^ (]+)/<keyword>\1<\/keyword>/g
    
    # Types
    s/(int)/<keyword>\1<\/keyword>/g
    s/(char)/<keyword>\1<\/keyword>/g
    s/(struct)/<keyword>\1<\/keyword>/g
    
    # Keywords
    s/(switch)/<keyword>\1<\/keyword>/g
    s/(case)/<keyword>\1<\/keyword>/g
    s/(default)/<keyword>\1<\/keyword>/g
    s/(new)/<keyword>\1<\/keyword>/g
    s/(delete)/<keyword>\1<\/keyword>/g
    s/(typedef)/<keyword>\1<\/keyword>/g
    s/(return)/<keyword>\1<\/keyword>/g
    s/(const)/<keyword>\1<\/keyword>/g
    s/(friend)/<keyword>\1<\/keyword>/g
    
    s/(public:)/<keyword>\1<\/keyword>/g
    s/(private:)/<keyword>\1<\/keyword>/g
    s/(protected:)/<keyword>\1<\/keyword>/g
    
    # This one is special, if we are not careful we also match class= from all the above <span>'s.
    s/(class )/<keyword>\1<\/keyword> /g
    
    # Literal strings, we want to highlight them, but there is a catch
    # in C++ we are allowed to start a literal string on one line, and then continue it on the next line
    # this means we need to make sure we parse that correctly!
    
    /"/ {
    
    	# Are both the opening and closing quote on the same line? If so, branch to noloop
    	/"[^"]*"/b noloop
    	
    	# No, they are apparently not. This means we replace the quote with the correct span tag
    	s/"([^"]*)/<text>"\1/
    	
    	# Label the loop
    	: loop
    	
    	# Append the next line from the input file to the current line, move cursor forward by one
    	N
    	
    	# Did we find another quote character yet? If not, we branch to label loop
    	/"[^"]+"/!b loop
    	
    	# Ah, we found it. Replace it with the correct span tag.
    	s/("[^"]+)"/\1"<\/text>/
    	
    	b endquote
    	
    	# Label the noloop branch
    	: noloop
    	
    	s/"([^"]*)"/<text>"\1"<\/text>/g
    	
    	: endquote
    	
    	# String literals should not contain "syntax" highlighted code. So we remove all keyword tags from them
    	# Label removetags
    	: removetags
    
    	# Remove <keyword> and </keyword> from the source file
    	s/("[^<]+)<keyword>([^"]+")/\1\2/g
    	s/("[^<]+)<\/keyword>([^"]+")/\1\2/g
    
    	# Check if there are any more keyword tags left in this part of the string literal
    	# if so, we branch back to removetags. We basically loop until this condition returns false.
    	/"[^"]+<keyword>[^"]+"/b removetags
    }
    
    : url
    # Replace URL's {Text here}[URL here]
    s/\{([^\}]*)\}\[([^\[]*)\]/<a href="\2" alt="\1">\1<\/a>/g
    
    : end
    
    # Replace <keyword> and </keyword> with their span equivalent
    s/<keyword>/<span class="keyword">/g
    s/<\/keyword>/<\/span>/g
    s/<text>/<span class="text">/g
    s/<\/text>/<\/span>/g
    New and improved version!
     
  7. osnnraptor

    osnnraptor OSNN One Post Wonder

    Messages:
    6
    binoyxj likes this.
  8. X-Istence

    X-Istence * Political User

    Messages:
    6,498
    Location:
    USA
    Code:
    #!/usr/bin/sed -E -f 
    
    ###
     # Copyright (c) 2009 Bert JW Regeer <xistence@0x58.com>
     #
     # Permission to use, copy, modify, and distribute this software for any
     # purpose with or without fee is hereby granted, provided that the above
     # copyright notice and this permission notice appear in all copies.
     #
     # THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
     # WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
     # MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
     # ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
     # WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
     # ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
     # OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
     #
    ##
    
    # Remove anything HTML is going to hate us for
    s/</\&lt;/g
    s/>/\&gt;/g
    
    # Pre-processor
    /^#.*/ {
    	# We add a span to syntax highlight it
    	s/(#.*)/<span class="preproc">\1<\/span>/
    	
    	# We want to make links out of the headers we have written
    	s/#include \"(.*)\"/#include "<a href="\.\/\1tml" alt="\1">\1<\/a>"/
    	
    	# For pre-processor directives we do no other processing what so ever!
    	b
    }
    
    # We syntax highlight the standard library stuff (not sure if I want to turn this on)
    
    # s/(std::[^ (]+)/<keyword>\1<\/keyword>/g
    
    # Types
    s/(int)/<keyword>\1<\/keyword>/g
    s/(char)/<keyword>\1<\/keyword>/g
    s/(struct)/<keyword>\1<\/keyword>/g
    
    # Keywords
    s/(switch)/<keyword>\1<\/keyword>/g
    s/(case)/<keyword>\1<\/keyword>/g
    s/(default)/<keyword>\1<\/keyword>/g
    s/(new)/<keyword>\1<\/keyword>/g
    s/(delete)/<keyword>\1<\/keyword>/g
    s/(typedef)/<keyword>\1<\/keyword>/g
    s/(return)/<keyword>\1<\/keyword>/g
    s/(const)/<keyword>\1<\/keyword>/g
    s/(friend)/<keyword>\1<\/keyword>/g
    
    s/(public:)/<keyword>\1<\/keyword>/g
    s/(private:)/<keyword>\1<\/keyword>/g
    s/(protected:)/<keyword>\1<\/keyword>/g
    
    # This one is special, if we are not careful we also match class= from all the above <span>'s.
    s/(class )/<keyword>\1<\/keyword> /g
    
    # Multiline comments
    # Issues:
    #
    # It is valid C/C++ to do this:
    #
    # /* this is a comment */ myclass = new myclass(); /* comment again */
    # this parse has one hell of a greedy regular expression, if you can figure out a way to make it not-greedy, you sir are a god
    # back to the issue
    # instead of turning that into the following <comment> <code> <comment> it becomes <comment>. Yes, the entire line
    # is now a comment. That is bad. So don't use multiple comments on the same line, and you will be fine!
    
    
    /\/\*/ {
    	# Are both the opening and closing quote on the same line? If so, branch to noloop
    	/\/\*([^\*][^\/]+)*\*\//b cnoloop
    	
    	# No, they are apparently not. This means we replace the quote with the correct span tag
    	# s/(\/\*.*$)/<comment>\1/
    	
    	# Label the loop
    	: cloop
    	
    	# Append the next line from the input file to the current line, move cursor forward by one
    	N
    	
    	# Did we find another quote character yet? If not, we branch to label loop
    	/\/\*.*\*\//!b cloop
    	
    	# Label the noloop branch
    	: cnoloop
    	
    	s/(\/\*.*\*\/)/<comment>\1<\/comment>/g
    	
    	: endcomment
    
    	: cremovetags
    	
    	s/((\/\*)[^<]+)<keyword>(.*\*\/)/\1\3/g
    	s/((\/\*)[^<]+)<\/keyword>(.*\*\/)/\1\3/g
    	
    	/\/\*[^<]+<keyword>([^\*][^\/]+)*\*\//b cremovetags
    	
    	# We don't want to process literal strings
    	
    	b end
    }
    
    /\/\/ .*/ {
    	s/(\/\/ .*)/<comment>\1<\/comment>/
    	
    	: cpremovetags
    
    	s/(\/\/ [^<]*)<keyword>(.*)$/\1\2/g
    	s/(\/\/ [^<]*)<\/keyword>(.*)$/\1\2/g
    	
    	/\/\/ [^<]*<keyword>.*$/b cpremovetags
    	
    	# We don't want to process literal strings
    	
    	b end
    }
    
    # Literal strings, we want to highlight them, but there is a catch
    # in C++ we are allowed to start a literal string on one line, and then continue it on the next line
    # this means we need to make sure we parse that correctly!
    
    /"/ {
    	# Are both the opening and closing quote on the same line? If so, branch to noloop
    	/"[^"]*"/b qnoloop
    	
    	# Label the loop
    	: qloop
    	
    	# Append the next line from the input file to the current line, move cursor forward by one
    	N
    	
    	# Did we find another quote character yet? If not, we branch to label loop
    	/"[^"]+"/!b qloop
    	
    	# Label the noloop branch
    	: qnoloop
    	
    	# String literals should not contain "syntax" highlighted code. So we remove all keyword tags from them
    	# Label removetags
    	: qremovetags
    
    	# Remove <keyword> and </keyword> from the source file
    	s/("[^<]*)<keyword>([^"]+")/\1\2/g
    	s/("[^<]*)<\/keyword>([^"]+")/\1\2/g
    
    	# Check if there are any more keyword tags left in this part of the string literal
    	# if so, we branch back to removetags. We basically loop until this condition returns false.
    	/"[^"]*<keyword>[^"]+"/b qremovetags
    	
    	s/"([^"]+)"/"<text>\1<\/text>"/g
    }
    
    : end
    
    s/<([^\/][^>]+)>/<span class="\1">/g
    s/<\/[^>]+>/<\/span>/g
    
    : url
    
    # Replace URL's {Text here}[URL here]
    s/\{([^\}]*)\}\[([^\[]*)\]/<a href="\2" alt="\1">\1<\/a>/g
    Updated to the latest version I had sitting in my subversion. Also added a license so that people who would like to use it are now able to do so without infringing upon my copyright!