• I hope someone can help me. I am at my wits’ ends. ??

    I am looking for a WYSIWYG editor that produces standards compliant code, particularly when copy/pasting from MSWord. I’ve just spent the last 7 hours looking for a solution, to no avail.

    Before you go on a tangent and rant against WYSIWYG (as so many people seem to do when someone is asking WYSIWYG questions), I ought to point out that there are situations where it is necessary to use them. Myself, I have been using Notepad to handcode html since before Lynx was pretty much the only web browser option (yes, i’m that old). But the site will be used as a CMS by people who are not HTML savvy and who have much better things to do than to learn html.

    I have downloaded and tried several plugins:
    – Chenpress
    – WYSI-Wordpress
    – Xinha4WP

    I had a look at WYSIWYG Pro.

    I had a play with X-Valid

    Obviously tried the built-in Tiny MCE.

    None of these allow you to cut/paste from MSWord and get clean HTML, free of extra classes and spans and formatting.

    The only editor that seems to handle that properly is XStandards. But I can’t seem to find anyone who’s done an XStandard integration into WordPress. I’ve spend quite a bit of time on their site, and I just don’t see how to do such an integration.

    Surely, I’m not the only one that has such a need? It’s not asking for much, just want to be able to cut/paste from Word into the editor and not have cr*p code show up.

    Has anyone found a solution?

    //Edit: Saving a word doc to text and then copy pasting from notepad into WP is not a desirable solution.

Viewing 10 replies - 16 through 25 (of 25 total)
  • Hi all, i loved reading this discussion, since i’ve been having the same exact problem vavroom posted here, and had no real solutino to the problem, for many years.
    I believe i finally found a good solution, it involves some client side scripting, but it solves ALL my problems.
    The approach is quite simple, I’m using a DIV with contentEditable set to true, then i’m catching the onpaste event (which takes care of all pastes – edit menu, right click and ctrl+v), and ondrop (which takes care of dragging content into my rich text editor area)
    when pasting or dropping occurs, i go over the entire HTML in the editor area, and simply remove all unneeded tags, styles and classes, resulting with clean HTML code.
    This does not require user to do a thing, straight copy paste from any rich html application (word,excel,another web page). I also disabled the undo feature (handling ctrl+z), making sure user cannot undo the automatic HTML formatting:

    <html >
    <head>
    <title></title>

    </head>
    <body >

    <DIV
    style=”width:300px;height:100px;
    overflow: auto;
    border: thin inset;
    font-weight: normal”
    id=”TEXTEDITOR1″
    name=”TEXTEDITOR1″
    onpaste=”onPasteHandler();”
    ondrop=”onPasteHandler();”></DIV>

    <SPAN
    id=”TEXTEDITOR1Message”
    style=”color:red”></span>

    <SCRIPT language=”JavaScript”>

    function handleImportsTEXTEDITOR1() {

    var aSourceHTML = TEXTEDITOR1.innerHTML;
    var newHTML = document.createElement(‘SPAN’);
    newHTML.innerHTML = aSourceHTML
    nNode = processNode(newHTML);
    if (nNode != newHTML)
    newHTML = nNode;

    TEXTEDITOR1.innerHTML = newHTML.innerHTML;
    document.all.TEXTEDITOR1Message.innerText = ‘Note: Some of the content you just pasted have been modified to conform to our web standards’
    window.setTimeout(‘document.all.TEXTEDITOR1Message.innerText = “”‘, 8000);

    }

    function processNode(obj) {

    for (var i=0; i < obj.childNodes.length; i++) {
    var nObj = processNode(obj.childNodes(i));
    if (nObj != obj)
    obj.replaceChild(nObj,obj.childNodes(i))
    }

    if (!validTag(obj)) {
    var newNode = document.createElement(‘SPAN’)
    newNode.innerText = obj.innerText
    return newNode;
    } else {
    try {
    // Removing classes
    var attr = obj.className + ”;
    if ((attr != ”) && (attr != ‘undefined’)) obj.className = ”;
    // Removing styles
    var attr = obj.style + ”;
    if ((attr != ”) && (attr != ‘undefined’)) obj.style = ”;
    } catch (e) {}

    return obj;
    }

    }

    function validTag(node) {
    // Cleanup function, all tags not listed here will be removed!!
    var nodeName = node.nodeName.toUpperCase();
    if (nodeName == ‘#TEXT’) return true;
    else if (nodeName == ‘BR’) return true;
    else if (nodeName == ‘FONT’) return true;
    else if (nodeName == ‘B’) return true;
    else if (nodeName == ‘STRONG’) return true;
    else if (nodeName == ‘SPAN’) return true;
    else if (nodeName == ‘I’) return true;
    else if (nodeName == ‘BLOCKQUOTE’) return true;
    else if (nodeName == ‘A’) return true;
    else if (nodeName == ‘DIV’) return true;
    else if (nodeName == ‘P’) return true;
    else if (nodeName == ‘OL’) return true;
    else if (nodeName == ‘UL’) return true;
    else if (nodeName == ‘LI’) return true;
    return false;
    }

    function onPasteHandler(e) {
    setTimeout(function() {
    // editor cleaning code goes here
    handleImportsTEXTEDITOR1();
    }, 1); // 1ms should be enough
    }

    function myKeyHandler() {
    if (event.keyCode != null) {
    if (event.keyCode == 90) { // “z” pressed
    if (event.ctrlKey) { // CTRL+Z pressed – disabling
    event.returnValue = false;
    }
    }
    }
    }

    TEXTEDITOR1.contentEditable = true;
    TEXTEDITOR1.innerHTML = ‘<FONT face=Arial size=2>Preload your HTML content here</FONT>’;
    TEXTEDITOR1.document.body.style.margin = ‘0px’;
    TEXTEDITOR1.document.onkeyup = handleChangeTEXTEDITOR1;
    document.attachEvent(“onkeydown”, myKeyHandler);

    </script>

    </body>
    </html>

    let me know if this works for you??

    Client-side filtering is a bad idea for anything serious and should not be trusted. If you want to get rid of the MsoNormals Microsoft Word is so fond of, be my guest, but realize that anything done in JavaScript can (easily) be circumvented.

    Client-side scripting, and specifically JavaScript has been around since before 1997 (when netscape 3 was introduced). Javascript and client side scripting have evolved (already 5-6 years ago) to the point that you can code a client side application (using a scripting language, and html only) that will rival the most sophisticated windows applications.

    I believe it is time to let go of the old notion “Client-side is bad idea for any serious coding” – it is a mature and reliable platform as well as the web browsers you are dealing with (IE/FFox), and it has been for a while.

    If the argument is about the ability to turn scripting off on your browser – true, but when it comes down to it, no site today will dislpay properly if you turn scripting (or even cookies) off, plus the average user does not know how to turn scripting off, so its really not an argument.
    If the argument is JavaScript errors? those can be easily caught and handled, and any programmer knows he needs to catch possible errors, specifically in the web environment.

    Regarding the solution i offered, it is the only solution that i’ve seen that actually solves the problem (or even come close), and can be easily modified to your own company needs, if it is to block specific HTML tags, or to remove unnecessary styling, this does the trick!!
    and all of it is a few lines of code, no need for downloads, no need to educate the user, its is seemless.

    Love client side ?? dont hate … (For no real reason that is)

    You’re dead on. Yes: JavaScript is here to stay, and you’d be a fool not to use it. Yes: JavaScript is a fully featured programming language: Mozilla Firefox is practically built on JavaScript. Yes: the average user does not turn off scripting.

    But there’s one issue that no amount of client-side scripting can fully replace: filtering incoming data. “Client-side is bad idea for any serious coding” does not equal “Do not trust data that comes from the client.” The former is false, the latter true. Because JavaScript can be turned off, *any* security checks (for example, removing undesirable tags and attributes), can easily be circumvented.

    Let’s give an example. First the normal use case:

    1. Bob wants to post a MSWord document. He copy pastes it into a text editor
    2. JavaScript (client side) transparently cleans up the formatting for him
    3. Bob presses submit, it gets sent to the server, which DOESN’T do any other checking, and puts it on the result page.

    How to abuse:

    1. Mallory surfs to the web page and turns of JavaScript. She fills in the web form with malicious, raw HTML
    2. Data gets sent to server, since the server doesn’t do any checking, XSS and other meanies get onto the HTML page.

    JavaScript is great for thwarting good-faith incompetency/blundering, but against a determined attacker it is no good. You must implement server-side filtering with something like HTML Purifier.

    P.S. Theoretically speaking, websites should degrade gracefully: when JavaScript is turned off, they should still function, albeit without any of the client-side flashiness/polish. Alas, this is not true of many websites, but most still are like that. Personally, I use NoScript to block scripting on all sites I visit, and then enable scripting on a case by case basis.

    Are you involved with HTML Purifier?

    Yes. ?? I’m quite proud of the library.

    Don’t get me wrong: it’s really frustrating seeing people constantly botching HTML filtering. It’s a *hard* problem to solve. You can read this comparison for more info.

    S’okay with me, I just figured you were; but I do think you should have so stated to begin with. I’m fully aware of the problems, and your library is one I’ve looked at myself for use – and had I not been in the middle of boot drive problems, would have downloaded before now.

    Sorry about that, I’ll be sure to make it clear in the future.

    “I’ve just spent the last 7 hours looking for a solution, to no avail.”

    Don’t despair, these things can take much more than seven hours.

    “Before you go on a tangent and rant against WYSIWYG (as so many people seem to do when someone is asking WYSIWYG questions), I ought to point out that there are situations where it is necessary to use them.”

    I won’t go on a tangent but have recommended the excellent Markdown Extra for quite a while now and most of my clients love it. While it takes a 15 minute investment to learn, it’s very intuitive and the (simple) rules are hard to forget. I too have had some users who initially resisted switching from Word but they got used to Markdown very, very quickly and now won’t go back. I strongly suggest you offer this as an alternative.

    We have just launched version 1.0 of blog.dot. This template and associated DLL produces a clean HTML export from Microsoft Word. If you install the MySQL ODBC driver you can post directly to your WordPress blog from MS Word.

    Download version 1.0.

Viewing 10 replies - 16 through 25 (of 25 total)
  • The topic ‘Clean HTML from MS Word.’ is closed to new replies.