Anyone working with XML would know the following 5 characters are reserved in XML and need to be encoded/escaped anytime they occur as content in XML (which could either be as attribute or element values):

  • < (which needs to be escaped as & lt; in content)
  • > (which needs to be escaped as & gt; in content)
  • & (which needs to be escaped as & amp; in content)
  • ' (which needs to be escaped as & apos; in content)
  • " (which needs to be escaped as & quot; in content)

Off-course you can use the numeric encoding (e.g. & #60; for <) too, but the point is the special characters need to be encoded when they appear as content in XML.

I was recently surprised to see a big vendor in Travel domain was supplying us XML for consumption where-in their devs were randomly encoding the XML special characters in some part of their content while leaving the characters unencoded in other parts of their XML content, e.g. they were sending us tags like:

<url>https://www.theirdomain.com?param1=value1&amp;param2=value2</url>

Notice the & in the url is unencoded which makes it malformed XML; causing our C# code parsing the XML fail at the initial step of loading the XML into a DomDocument itself. I talked to the client where-in we were told there’s next to zero possibility the vendor could get this fixed on their end quickly and we need to handle the malformed XML on our end as our integration was failing causing production issues.

There was no point trying to read and parse the XML ourselves correcting the malformed content as the XML stream was being read as the same would have taken a considerable amount of time. So I thought we should rather read the entire XML stream as a string and then use Regex’es to try and correct the malformed XML.

After a few iterations, I was able to come up with a method which escaped the unescaped special characters in element values:

public static string fixUnescapedCharacters (string content)
{
	var regexStr = @"<([a-zA-Z0-9_]+?)>((.)+?)</\1>";

	var tagRegexStr = "(<([a-zA-Z0-9_])>)|(</([a-zA-Z0-9_])>)";
	var unescapedCharacterRegexStr = @"(&amp;|<|&gt;|&quot;|&apos;|&amp;|<|>|""|')";

	var regex = new Regex(regexStr);
	var tagRegex = new Regex(tagRegexStr);
	var unescapedCharacterRegex = new Regex(unescapedCharacterRegexStr);

	var output = regex.Replace(content, delegate (Match match)
	{
		var elementContent = match.Groups[2].Value;
		if (tagRegex.IsMatch(elementContent))
		{
			return (match.ToString());
		}
		if (!unescapedCharacterRegex.IsMatch(elementContent))
		{
			return (match.ToString());
		}

		elementContent = unescapedCharacterRegex.Replace(elementContent, delegate (Match match2)
		{
			var unescapedContent = match2.ToString();
			unescapedContent = unescapedContent.Replace("&amp;", "&amp;").Replace("<", "<").Replace("&gt;", ">").Replace("&quot;", "\"").Replace("&apos;", "'").Replace("&amp;", "&amp;").Replace("<", "<").Replace(">", "&gt;").Replace("\"", "&quot;").Replace("'", "&apos;");

			return (unescapedContent);
		});

		return (match.Result("<$1>" + elementContent + "</$1>"));
	});

	return (output);
}

Basically the above method:

  1. Uses a primary regex (line 3 above) to get matching opening/closing element tags from XML.
     
  2. It then uses a Regex.Replace call with a callback to perform custom replacement on each Match of the Regex from Step 1).
     
  3. In the callback, we do not process any tags which have nested tags in them, e.g.
    <outer><inner></inner></outer>

    We only need to process the content in innermost tags (atleast that was how the XML content we were parsing was. If your XML has elements with content as well as nested tags, you will need to tweak the logic a bit for yourself then).
     
  4. We further use another regex (line 6) to either match escaped special characters or unescaped special characters. The important thing to note is Regex groups are matched from from left to right and the escaped special character groups in this regex occur before the unescaped special characters.
     
  5. For each match from this Regex, we use a sequence of String.Replace calls (line 27) to first unescape and then escape everything that needs to be unescaped. The reason for unescaping first is the content we were receiving contained unescaped as well as escaped special characters in the same XML element value, and we just hacked our way around using the said sequence of String.Replace commands to prevent use of more complex regex patterns.

And finally we had our correct escaped XML with us that we were able to then process with either XDocument or XmlDocument classes.

Please note the approach in this blog post:

  • only corrects unescaped characters in element values (not in element attribute values). If you have a requirement for escaping them in element attribute values too, you will probably need to tweak the regexes above for capturing the attribute values.
  • assumes an element either contains value only or contains nested tags. In our use-case, an element did not contain both value/text as well as nested tags. If this is your use-case, you will need to tweak the code a bit again.

Hope this helps someone, Happy coding!!