I've been working with government generated HTML this week.By the looks of it, I assume the HTML I am working with is generated and exported directly from Satan's Word Processor. Here is a small sample of what some of this html looks like.
<table style=\ "\"font-family:Times New Roman;font-size:1em;width:340px;border:1px solid black;\ "\">
I have no clue what is going on in the sample. It looks like the double quotes are being unsuccessfully escaped. In addition, why does the government assume I want any of their styling?
Clean this up
The HTML I cam working with comes with attributes that can hurt the styling of my page. My goal is to remove all attributes and be left with the plain old HTML tags.
First, we need to install the HtmlAgilityPack Nuget package.
> Install-Package HtmlAgilityPack
Once installed, we load the html into an
HtmlDocument and strip all attributes from our HTML tags.
var doc = new HtmlDocument();
var sb = new StringBuilder();
var tags = doc.DocumentNode.SelectNodes("//*");
if (tags != null)
foreach (var tag in tags)
using (var textwriter = new StringWriter(sb))
Note, we can strip attributes selectively if we call
Remove on the
Attributes collection using the name of the attribute. If we only wanted to remove the
style attributes, we would use the following code.
You can try out HtmlAgilityPack using this DotNetFiddle.