HTML Manipulation – MSHTML vs HTML Agility Pack

Best HTML Parser

When it comes to HTML page content manipulation, there are two main HTML parsers which goes well with .NET framework.

  • MSHTML by Microsoft
  • HTML Agility Pack

Out of the two, according to my experience, Agility Pack outweigh MSHTML easily. Following are the main reasons.

  1. Proper documentation.
  2. Customer support.
  3. Support large HTML contents.
  4. More features/methods.

MSHTML is a legacy Microsoft library and you have to add it to your project from your computer’s assembly location (Add Reference -> Assemblies -> Type “HTML”, you will see Microsoft.mshtml. Then add it to your project). Or you can try the Nuget package version as well.

To start manipulating HTML content, you have to,

Load the HTML content

  • MSHTML

If you pass your HTML content using a string parameter called “content”,

var htmlDoc = new HTMLDocument();

var ihtmlDoc = (IHTMLDocument2)htmlDoc;

ihtmlDoc.open();

ihtmlDoc.write(content);

  • Agility Pack

HtmlWeb web = new HtmlWeb();

var uri = new Uri(FilePath)

var htmlDoc = web.Load(uri, “POST”);

If you are trying to manipulate a large page, we would recommend you to use HTML Agility Pack without a doubt. Our .Net development team has experienced limitations with MSHTML when it comes to handling large HTML content. It was with one of our SharePoint Development (with Azure) projects and the team had to struggle a lot to pinpoint the issue. Once we pushed the solution to one of our Azure App Services, MSHTML solution tends to break at “itmlDoc.write(content)” code line without even giving an error or stack trace. It could be either a memory leak or an unhandled exception with that library.

Following are some common parse commands that you can use to manipulate an HTML page.

Deleting a Node

  • MSHTML

var childNode = htmlDoc.getElementById(“u_0_1i”);

if (childNode!= null)

{

var parentNode = ((mshtml.HTMLDivElement)childNode).parentNode;parentNode.removeChild((IHTMLDOMNode)childNode);

}

  • HTML Agility Pack

var childNode = htmlDoc.GetElementbyId(“u_0_1i”);

if (childNode!= null)

{

childNode.ParentNode.RemoveChild(childNode, false);

}

In both the cases, we have to get the Child Element and then use its Parent Element to delete the child.

Set CSS of an Element

  • MSHTML

var postContentContainer = htmlDoc.getElementById(“content_container”);

if(postContentContainer!=null)

postContentContainer.style.cssText= “width: 100%!important;”;

  • HTML Agility Pack

var postContentContainer = htmlDoc.GetElementbyId(“content_container”); if (postContentContainer != null) { postContentContainer.GetAttributeValue(“style”,null); postContentContainer.SetAttributeValue(“style”, “width: 100%!important;”); }

Note: We had no luck with “postContentContainer.Attributes.Append(“style”);” when it comes to changing the style of an element with HTML Agility Pack. It has to be “GetAttributeValue” and then “SetAttributeValue” to overwrite the styles.

Set Attributes of an Element

  • MSHTML

var nodeSample= htmlDoc.getElementById(“content_container”); nodeSample.setAttribute(“width”,”100%”);

  • HTML Agility Pack

var globalContainer = htmlDoc.GetElementbyId(“globalContainer”); if (globalContainer != null) { globalContainer.Attributes.Append(“width”); globalContainer.SetAttributeValue(“width”, “100%!important;”); }

Change CSS Class Name

  • MSHTML

nodeSample= htmlDoc.getElementById(“content_container”);

nodeSample.className= “_2pie _14i5 _1qkx”;

  • HTML Agility Pack

var videosElement = htmlDoc.GetElementbyId(“videos”); if (videosElement != null) { videosElement.ReplaceClass(“newClass”, “oldClass”); }

Overwrite CSS or Add New Classes

  • MSHTML

mshtml.IHTMLStyleSheet css =(mshtml.IHTMLStyleSheet)ihtmlDoc.createStyleSheet(“”, 0);

css.cssText = “.uiScaledImageContainer { width: 100% !important; height: auto!important; } ._4-eo { width: 100% !important; }”

  • HTML Agility Pack

var cssString = “.uiScaledImageContainer { /*image container*/ width: 100% !important; height: auto!important; } ._4-eo { /*image a tag*/ width: 100% !important; }” var styles = htmlDoc.CreateElement(“style”); var styleText = htmlDoc.CreateTextNode(cssString); styles.AppendChild(styleText); htmlDoc.DocumentNode.AppendChild(styles)

Select a Particular Nodes

  • HTML Agility Pack

var imgs = htmlDoc.DocumentNode.SelectNodes(“//a”);

Overwrite an Attribute

  • HTML Agility Pack

var imgs = htmlDoc.DocumentNode.SelectNodes(“//a”);

foreach (var node in imgs)

{

if (node.Attributes.Contains(“target”))

node.Attributes.Remove(“target”);

node.Attributes.Append(“target”);

node.SetAttributeValue(“target”, “_blank”);

}

Select Nodes with REGEX

  • HTML Agility Pack

var nodes = htmlDoc.DocumentNode.SelectNodes(“//div[contains(@class, ‘_5va1 _427x’)]”);

This will get all the DIVs that contains the class

Save the document at the end to see the changes.

  • HTML Agility Pack

using (StreamWriter sw1 = new StreamWriter(filePath + “/test.html”)) { htmlDoc.Save(sw1); }

Summary

With the experience of handling all these libraries to a greater level, I would recommend to pick HTML Agility Pack over MSHTML Class Library without a doubt.

If you want more support, please contact us.

#NetDevelopment #SharePointDevelopment