How to clean up html code from garbage. Online converter of Excel, Word, txt files to pure HTML code without unnecessary CSS styles. plugin source

Hello!

When writing my own WYSIWYG editor, there was a problem with copying text from Word. Actually there are three problems:

  • Word inserts a lot of trash html code to be cleaned
  • For some reason, Word uses paragraphs instead of UL and LI tags to represent lists.
  • Actually how to determine that the inserted text is inserted from the Word.
In general, to solve these problems, a jquery plugin was written, complete source which is available at the end of the article. Usage example:

$ ('# Editor'). msword_html_filter ();
The plugin is hung on the event keyup and checks if the source code inside the editor is inserted from Word, if so, then the cleanup function is launched. Everything that is possible is nailed into the resulting html - non-breaking spaces, attributes style and align, tags span, all Mso-classes, empty paragraphs.

Details of the implementation under the cut.

Most of the regulars used were spied on by TinyMCE.

How to determine if a line contains html-code inserted from Word:

If (/class="?Mso|style=""""""place*\bmso-|style=""#"""#\bmso-|w:WordDocument/i.test(content) ") (...)

Code cleanup function (jquery editor object is passed to the function):

Function word_filter (editor) (var content = editor.html (); // Word comments like conditional comments etc content = content.replace (/<(!|script[^>]*>.*?<\/script(?=[>\ s]) | \ /? (\? xml (: \ w +)? | img | meta | link | style | \ w: \ w +) (? = [\ s \ />])) [^>] * > / gi, ""); // Convert into <(\/?)s>/ gi, "<$1strike> ___ ([\ s \ u00a0] *)<\/span>/ gi, function (str, spaces) (return (spaces.length> "; if (/^\s*\w+\./.test(txt)) (var matches = /()\./.exec(txt ); if (matches) (var start = parseInt (matches, 10); list_tag = start> 1? "": "";) else (list_tag = "";)) if (cur_level> "+ $ (this) .html () +"") $ (this) .remove (); last_level = cur_level;) else (last_level = 0;))) $ (" ", editor) .removeAttr (" style "); $ (" ", editor) .removeAttr ( "align"); $ ("span", editor) .replaceWith (function () (return $ (this) .contents ();)); $ ("span: empty", editor) .remove (); $ ( "", editor) .removeAttr ("class"); $ ("p: empty", editor) .remove ();)

Full source code of the plugin under the spoiler, save to file jquery.msword_html_filter.js

plugin source

(function ($) ($ .fn.msword_html_filter = function (options) (var settings = $ .extend ((), options); function word_filter (editor) (var content = editor.html (); // Word comments like conditional comments etc content = content.replace (// gi, ""); // Remove comments, scripts (e.g., msoShowComment), XML tag, VML content, // MS Office namespaced tags, and a few other tags content = content.replace (/<(!|script[^>]*>.*?<\/script(?=[>\ s]) | \ /? (\? xml (: \ w +)? | img | meta | link | style | \ w: \ w +) (? = [\ s \ />])) [^>] * > / gi, ""); // Convert into for line-though content = content.replace (/<(\/?)s>/ gi, "<$1strike>"); // Replace nbsp entites to char since it" s easier to handle // content = content.replace (/ / gi, "\ u00a0"); content = content.replace (/ / gi, ""); // Convert ___ to string of alternating // breaking / non-breaking spaces of same length content = content.replace (/ ([\ s \ u00a0] *)<\/span>/ gi, function (str, spaces) (return (spaces.length> 0)? spaces.replace (/./, "") .slice (Math.floor (spaces.length / 2)). split ("") .join ("\ u00a0"): "";)); editor.html (content); // Parse out list indent level for lists $ ("p", editor) .each (function () (var str = $ (this) .attr ("style"); var matches = / mso-list: \ w + \ w + (+) /. exec (str); if (matches) ($ (this) .data ("_ listLevel", parseInt (matches, 10));))); // Parse Lists var last_level = 0; var pnt = null; $ ("p", editor) .each (function () (var cur_level = $ (this) .data ("_ listLevel"); if (cur_level! = undefined) (var txt = $ (this) .text (); var list_tag = ""; if (/^\s*\w+\./.test(txt)) (var matches = /()\./.exec(txt); if (matches) (var start = parseInt ( matches, 10); list_tag = start> 1? "": "";) else (list_tag = "";)) if (cur_level> last_level) (if (last_level == 0) ($ (this) .before (list_tag ); pnt = $ (this) .prev ();) else (pnt = $ (list_tag) .appendTo (pnt);)) if (cur_level "+ $ (this) .html () +"") $ (this) .remove (); last_level = cur_level;) else (last_level = 0;))) $ (" ", editor) .removeAttr (" style "); $ (" ", editor) .removeAttr ( "align"); $ ("span", editor) .replaceWith (function () (return $ (this) .contents ();)); $ ("span: empty", editor) .remove (); $ ( "", editor) .removeAttr ("class"); $ ("p: empty", editor) .remove ();) return this.each (function () ($ (this) .on ("keyup", function () (var content = $ (this) .html (); if (/ class = "? Mso | style =" [^ "] * \ bmso- | style =" [^ ""] * \ bmso- | w : WordDocument / i.test (content)) (word_filter ($ (this));)));)););)) (jQuery)


The performance was tested only in the latest Firefox.

Cleaner is a service for cleaning tags from "garbage" that remains in the document after the page is saved in the format from the program.

A long time ago I wrote a similar plugin, but it was whipped up, now the mechanism has been completely rewritten.

The code is cleared by iterating over the entered string from which a new one is formed, containing a "clean" one. The plugin removes absolutely everything from tags, including tags. In unpaired tags, the symbol / (slash) is put down. Empty tags are removed, for example, the construction will be removed, since it contains nothing.

How does html cleaner work?

There are two ways:

  1. In MS Word, select the data you want to clean up, to select everything, press Ctrl + A. Paste the copied text in the field below (the "Insert MS Office data" tab must be selected), click the "Finish" button.
  2. Before optimizing the code, select "Save As ..." in Word, then specify the File type "Web page with filter", then open the saved file in a text editor, copy the code and paste it into the field below (the "Insert HTML "), Click the" Finish "button.

As a result, you will get a pristine html code.
The following attributes remain unaffected:

"colspan", "rowspan", "href", "src", "type", "value", "lang", "tabindex", "title", "code", "alt", "target", "dir "," span "," action "," method "

Good day, dear readers! I hope you are as good as ours - the sun is shining, the birds are singing, it's warm and summer has come! I have a dissertation so far, so for the last month and a half I have been writing only once a week, physically I do not have time. But let's not talk about sad things, let's get down to business!

Once upon a time I dug around on the Internet on the topic of finding a script that cleans HTML code from garbage, which, in particular, leaves all of us "dearly beloved" in this regard, Microsoft Word. I have previously used code cleanup with Adobe Dreamweaver but he had two disadvantages:

    Sometimes it clears not everything that we would like.

    With a very large amount of code, the cleanup script throws an error.

The second point became critical for me, since I had to work with large html tables, from which it was impossible to leave on one site, and they provided all the information in Word.

Thus, wandering the Internet for a long time, I found a script that copes with all this economy with a bang, and at the same time is completely customizable.

Exel / Word to HTML is the perfect tool for editing the source code of WordPress articles or any other content management system when their built-in composer doesn't provide all the functionality we need. Compose content right in your browser window without installing an extension or plugin to handle syntax highlighting and other text editing functions.

How to use?

Paste the document you want to convert into a Word editor, then navigate to HTML view using the large tabs at the top of the page to generate the code.

Clean up the messy markup with a big button that executes the active (checked) parameters in the list. You can also apply these functions one by one using the CLEAN icon.

Conversion problems easily solved by our online HTML converter

The problem of converting word to html has probably always existed along with Microsoft Word. The sheer number of styles assigned to the text, like mso-spacerun: yes, and classes like MsoNormal, as well as the jumble of all sorts of span style = "font-size: 10.0pt" clutter up the code. And they often interrupt the native styles specified in the site. If simple text can still be handled by pasting text via the "Insert Text Only" editor button, then this method will not work with tables. Our converter is able to easily clean up any unnecessary comments and styles from the future html file by simple clicks on the buttons.


Online HTML cleaning from unnecessary CSS styles
  • Remove any unnecessary styles from all text or selection
  • Remove unnecessary codes, characters, etc. Unicode codes
  • Clean code from unnecessary spaces and duplicate tags
  • If required, remove the HTML markup completely.

Converting Word, Excel, TxT files into pure HTML source code. Without unnecessary styles and comments for direct correct insertion into site pages.

Supported formats for online conversion:

  • 97-2004 and newer DOC in HTML, DOCX in HTML;
  • XLS to HTML, XLSX to HTML;
  • PPT to HTML, PPTX to HTML;
  • TXT to HTML and many other formats.

Another useful use of the service, instead of spending hours making yourself a table in HTML, make it in 15 minutes in Excel or Word and convert it into clean, beautiful HTML code for embedding into the site.

Get rid of your dirty markup with the free online HTML Cleaner. It's very easy to compose, edit, format and minify the web code with this online tool. Convert Word docs to tidy HTML and any other visual documents like Excel, PDF, Google Docs etc. It's extremely simple and efficient to work with the two attached visual and source editor which respond instantly to your actions.

HTML Cleaner is equipped with many useful features to make HTML cleaning and editing as easy as possible. Just paste your code in the text area, set up the cleaning preferences and press the Clean HTML button. It can handle any document created with Microsoft Excel, PowerPoint, Google docs or any other composer. It helps you easily get rid of all inline styles and unnecessary codes which are added by Microsoft Word or other WYSIWYG editors. This HTML editor tool is useful when you're migrating the content from one website to the other and you want to clean up all alien classes and IDs the source site applies. Use the find and replace tool for your custom commands. The gibberish text generator lets you easily add dummy text to the editor.

On the top of the page you can see the visual editor and the source code editor next to each other. Whichever you modify the changes will be reflected on the other in real time. The visual HTML editor allows beginners to easily compose their content just like when using any other word processor program, while on the right the source editor with highlighted code markup helps the advanced users to adjust the code. This makes this online program a nice tool to learn HTML coding.

Convert Word Documents To Clean HTML

To publish online PDFs, Microsoft Word, Excel, PowerPoint or any other documents composed with different word editor programs or just to copy the content copied from another website, paste the formatted content in the visual editor. The HTML source of the document will be immediately visible in the source editor as well. The control bar above the WYSIWYG editor controls this field while all other source cleaning settings are for editing the source code. Click the Clean HTML button after setting up the cleaning preferences. Copy the cleaned code and publish it on your website.

There’s no guarantee that the program corrects all errors in your code exactly the way you want so please try to enter a syntactically valid HTML.

Convert the HTML tables to structured div elements activating the corresponding checkbox.

Cleaning HTML code from Microsoft Word (2000-2007) tags?

In the past web designers used to build their websites using tables to organize page layout, but in the era of responsive web design tables are outdated and DIV’s are taking their place. This online tool helps you turn your tables to structured div elements with a few simple clicks.

You can make your source code more readable by organizing the tabs hierarchy in a tree view.

Become a member

This website is a fully functional tool to clean and compose HTML code but you have the possibility to purchase a HTML G membership and access even more professional features. Using the free version of the HTML Cleaner you consent to include links in the edited documents. This cleanup tool might add a promotional third party link to the end of the cleaned documents and you need to leave this code unchanged as long as you use the free version.