• nosaint

    (@nosaint)


    The plugin is great but I’m having a huge problem with it.
    It breaks words appart when encounters special characters (language related). Any way to make it accept special characters and read the entire word instead of breaking it appart?

Viewing 8 replies - 1 through 8 (of 8 total)
  • chejo59

    (@chejo59)

    Same here.

    I’ve tried to add a new noise word list in spanish but when I save the options a message says: “The noise words you entered are in an invalid format.”

    Thanks in advance!

    Plugin Author strictly-software

    (@strictly-software)

    Hi Thanks for commenting

    This plugin was originally created for English language based blogs only.

    There are some known issues which occur if you try to parse UTF-8 characters or content written in another character set which include:
    -Non ASCII character can cause early stop gaps
    -Capital UTF-8 characters will not be treated as such so acronym and name detection doesn’t work.

    The main reason that it is targeted at English content only is that it is impossible for me to write regular expression logic that would be able to automatically detect people’s names and important words in every possible language (Chines, Russian, Indian etc) and character set. As I am an English speaker I know the logic that can be used to detect important words in my language however non ASCII based languages don’t have the same grammatical constructs.

    However when I first came across this issue I did spent some time trying to change all the regular expressions to use the new unicode character classes to see if I could resolve some of these issues and I did seem to get it working on a test page on my local PC. However when I tried the same code on a live site with a WordPress posted article it just didn’t behave in the same way and I was unable to get to the bottom of the reason why.

    Using an example article content of

    <p><span>??è Exemple : name to ??? tag – Patrick Lagacé Québec the tag is named Patrick Lagacé – name to tag ?é? èep – Québec the tag is named Québec</span></p>

    Which contains a number of capital UTF8 characters my test version returns matches for

    Patrick Lagacé
    Québec
    ??è
    ???
    Patrick Lagacé
    ??è Exemple
    Patrick Lagacé Québec
    ?é? èep Québec

    Obviously these are just meaningless examples but it proves that my code can match ACRONYMS and Names containing UTF-8. However when I copy the code to a WordPress site and enter the same content as an article the only match I get is

    Patrick Lagac

    I don’t have enough time to debug WordPress to find out what is causing the problem and the plugin was initially designed for the English language only. Without someone paying for custom development I doubt this issue will get resolved any time soon as I have to work full time as well as run a number of sites out of hours.

    Feel free to edit the source code yourself and change the regular expressions to see if you can get it working and if you can let me know so that I can incorporate it into the code.

    To help you get started here are a couple of versions of the functions I changed which work when I run a test page (using the exact same class code) on my local PC but doesn’t seem to on WordPress.

    protected function MatchAcronyms($content,&$searchtags){
    
    	// easiest way to look for keywords without some sort of list is to look for Acronyms like CIA, AIG, JAVA etc.
    	// so use a regex to match all words that are pure capitals 2 chars or more to skip over I A etc
    	//@preg_match_all("/\b([A-Z]{2,})\b/u",$content,$matches,PREG_SET_ORDER);
    
    	// This version handles UTF8
    	@preg_match_all("/\b(\p{Lu}{2,})\b/u",utf8_encode($content),$matches,PREG_SET_ORDER);
    
    	if($matches){
    
    		foreach($matches as $match){
    
    			$pat = utf8_decode($match[1]);
    
    			// ignore noise words who someone has capitalised as well as roman numerals which may be part of something else e.g World War II
    			if(!$this->IsNoiseWord($pat) && !$this->IsRomanNumeral($pat)){
    				// add in the format key=value to make removing items easy and quick plus we don't waste overhead running
    				// array_unique to remove duplicates!
    				$searchtags[$pat] = trim($pat);
    			}
    		}
    	}
    
    	unset($match,$matches);
    
    }
    
    protected function MatchNames($content,&$searchtags){
    
    	// look for names of people or important strings of 2+ words that start with capitals e.g Federal Reserve Bank or Barack Hussein Obama
    	// this is not perfect and will not handle Irish type surnames O'Hara etc
    	//@preg_match_all("/((\b[A-Z][^A-Z\s\.,;:]+)(\s+[A-Z][^A-Z\s\.,;:]+)+\b)/u",$content,$matches,PREG_SET_ORDER);
    
    	// This version handles UTF8
    	@preg_match_all("/((\b\p{Lu}(?:\p{Ll}|[^\s\.,;:])+)(\s+\p{Lu}(?:\p{Ll}|[^\s\.,;:])+)+\b)/u",utf8_encode($content),$matches,PREG_SET_ORDER);
    
    	// found some results
    	if($matches){
    
    		foreach($matches as $match){
    
    			$pat = utf8_decode($match[1]);
    
    			$searchtags[$pat] = trim($pat);
    
    		}
    	}
    
    	unset($match,$matches);
    }

    Thanks

    Thank you very much for your response. You’ve done a great job with your plugin and we appreciate all the effort you have made to make it work.

    I have some questions:
    As I’m not an expert in coding and in the example there are two versions of the code, which one should I use?
    Do the examples replace all the original code or only a portion of the code?

    Thread Starter nosaint

    (@nosaint)

    those are 2 functions.
    open strictlyautotags.class.php and search for

    MatchAcronyms

    (about line 342). select the function, between line 342 and around 368 where you see } before

    /**
    * Searches the passed in content looking for Countries to add to the search tags array

    , delete it and paste

    protected function MatchAcronyms($content,&$searchtags){
    
    	// easiest way to look for keywords without some sort of list is to look for Acronyms like CIA, AIG, JAVA etc.
    	// so use a regex to match all words that are pure capitals 2 chars or more to skip over I A etc
    	//@preg_match_all("/\b([A-Z]{2,})\b/u",$content,$matches,PREG_SET_ORDER);
    
    	// This version handles UTF8
    	@preg_match_all("/\b(\p{Lu}{2,})\b/u",utf8_encode($content),$matches,PREG_SET_ORDER);
    
    	if($matches){
    
    		foreach($matches as $match){
    
    			$pat = utf8_decode($match[1]);
    
    			// ignore noise words who someone has capitalised as well as roman numerals which may be part of something else e.g World War II
    			if(!$this->IsNoiseWord($pat) && !$this->IsRomanNumeral($pat)){
    				// add in the format key=value to make removing items easy and quick plus we don't waste overhead running
    				// array_unique to remove duplicates!
    				$searchtags[$pat] = trim($pat);
    			}
    		}
    	}
    
    	unset($match,$matches);
    
    }

    then search for

    MatchNames

    (around line 400) select the function (up to 422) including the } before

    `/**
    * check the content to see if the amount of content that is parsable is above the allowed threshold`

    delete it and paste

    protected function MatchNames($content,&$searchtags){
    
    	// look for names of people or important strings of 2+ words that start with capitals e.g Federal Reserve Bank or Barack Hussein Obama
    	// this is not perfect and will not handle Irish type surnames O'Hara etc
    	//@preg_match_all("/((\b[A-Z][^A-Z\s\.,;:]+)(\s+[A-Z][^A-Z\s\.,;:]+)+\b)/u",$content,$matches,PREG_SET_ORDER);
    
    	// This version handles UTF8
    	@preg_match_all("/((\b\p{Lu}(?:\p{Ll}|[^\s\.,;:])+)(\s+\p{Lu}(?:\p{Ll}|[^\s\.,;:])+)+\b)/u",utf8_encode($content),$matches,PREG_SET_ORDER);
    
    	// found some results
    	if($matches){
    
    		foreach($matches as $match){
    
    			$pat = utf8_decode($match[1]);
    
    			$searchtags[$pat] = trim($pat);
    
    		}
    	}
    
    	unset($match,$matches);
    }

    for me those functions didn’t changed anything. same problem with breaking up words. beside, it seems strictly auto tags is not taking into account the words in the black list… but thats another story.

    Plugin Author strictly-software

    (@strictly-software)

    As I said those functions were only working on a local test page I created that I ran from WAMP on my PC (see code at bottom of post)

    When I tried to update the plugin code on WordPress it didn’t make any difference therefore I suspect they are doing some decoding / encoding of their own along the line somewhere which is causing the problem.

    Here is the test page I created which I was running on my local PC through WAMPServer. As you can see (or should do as I do) the final array of collected names which is what gets passed to wordpress to be saved as tags is correct in that all the UTF-8 characters are contained.

    e.g the following test page when run locally returns this array

    Array ( [0] => Patrick Lagacé [1] => Québec [2] => ??è [3] => ??? [4] => Patrick Lagacé [5] => ??è Exemple [6] => Patrick Lagacé Québec [7] => ?é? èep Québec )

    So the answer lies in debugging WordPress’s own code to work out where the issue is.

    I don’t know how out of date the test code is to the actually plugin class code but the only point of it was to replicate the actions of my plugin without having to load up WordPress code etc. The main functions that would need updating to the plugin are those I listed above.

    <?php
    error_reporting(E_ALL);
    
    if(!defined('DEBUGAUTOTAG')){
    	define('DEBUGAUTOTAG',true);
    }
    
    if(!defined('AUTOTAG_BOTH')){
    	define('AUTOTAG_BOTH',0);
    }
    if(!defined('AUTOTAG_SHORT')){
    	define('AUTOTAG_SHORT',1);
    }
    
    require_once(dirname(__FILE__) . "\\strictly-autotags\\trunk\\strictlyautotagfuncs.php");
    
    class StrictlyAutoTags{
    
       /**
    	* look for new tags by searching for Acronyms and names
    	*
    	* @access protected
    	* @var bool
    	*/
    	protected $autodiscover; 
    
       /**
    	* treat tags found in the post title as important and automatically add them to the post
    	*
    	* @access protected
    	* @var bool
    	*/
    	protected $ranktitle; 
    
       /**
    	* The maxiumum number of tags to add to a post
    	*
    	* @access protected
    	* @var integer
    	*/
    	protected $maxtags; 
    
    	/**
    	* The percentage of content that is allowed to be capitalised when auto discovering new tags
    	*
    	* @access protected
    	* @var integer
    	*/
    	protected $ignorepercentage;
    
    	/**
    	* The list of noise words to use
    	*
    	* @access protected
    	* @var string
    	*/
    	protected $noisewords;
    
    	/**
    	* This setting determines how nested tags are handled e.g New York, New York City, New York City Fire Department all contain "New York"
    	* AUTOTAG_BOTH = all 3 terms will be tagged
    	* AUTOTAG_SHORT= the shortest version "New York" will be tagged and the others dicarded
    	* AUTOTAG_LONG = the longest version "New York City Fire Department" will be tagged and the others dicarded
    	*/
    	protected $nestedtags;
    
    	/**
    	* The default list of noise words to use
    	*
    	* @access protected
    	* @var string
    	*/
    	protected $defaultnoisewords = "about|after|a|all|also|an|and|another|any|are|as|at|be|because|been|before|being|between|both|but|by|came|can|come|could|did|do|each|even|for|from|further|furthermore|get|got|had|has|have|he|her|here|hi|him|himself|how|however|i|ii|if|in|indeed|into|is|it|its|just|like|made|many|may|me|might|more|moreover|most|much|must|my|never|not|now|of|ok|on|only|or|other|our|out|over|put|said|same|see|she|should|since|some|still|such|take|than|that|the|their|them|then|there|therefore|these|they|this|those|through|thus|to|too|under|up|very|was|way|we|well|were|what|when|where|which|while|who|will|why|with|would|you|your|yes|no|today|yesterday|tomorrow"; 
    
    	/**
    	* Holds a regular expression for checking whether a word is a noise word
    	*
    	* @access protected
    	* @var string
    	*/
    	protected $isnoisewordregex;
    
    	/**
    	* Holds a regular expression for removing noise words from a string of words
    	*
    	* @access protected
    	* @var string
    	*/
    	protected $removenoisewordsregex;
    
    	public function __construct(){
    
    		// set up values for config options e.g autodiscover, ranktitle, maxtags
    		//$this->GetOptions();
    		$this->autodiscover		= true;
    
    		$this->ranktitle		= true;
    
    		$this->rankspecial		= true;
    
    		$this->maxtags			= 8;
    
    		$this->ignorepercentage	= 80;
    
    		$this->noisewords		= $this->defaultnoisewords;
    
    		$this->nestedtags		= AUTOTAG_BOTH;
    
    		// create some regular expressions required by the parser
    
    		// create regex to identify a noise word
    		$this->isnoisewordregex		= "/^(?:" . $this->noisewords . ")$/i";
    
    		// create regex to replace all noise words in a string
    		$this->removenoisewordsregex= "/\b(" . $this->noisewords . ")\b/i";
    
    		// load any language specific text
    		//load_textdomain('strictlyautotags', dirname(__FILE__).'/language/'.get_locale().'.mo');
    
    		// add options to admin menu
    		//add_action('admin_menu', array(&$this, 'RegisterAdminPage'));
    
    		// set a function to run whenever posts are saved that will call our AutoTag function
    		//add_actions( array('save_post', 'publish_post', 'post_syndicated_item'), array(&$this, 'SaveAutoTags') );
    
    	}
    
    	/**
    	 * Check post content for auto tags
    	 *
    	 * @param integer $post_id
    	 * @param array $post_data
    	 * @return boolean
    	 */
    	public function SaveAutoTags( $post_id = null, $post_data = null ) {
    		$object = get_post($post_id);
    		if ( $object == false || $object == null ) {
    			return false;
    		}
    
    		$posttags = $this->AutoTag( $object );
    
    		// add tags to post
    		// Append tags if tags to add
    		if ( count($posttags) > 0) {
    
    			// Add tags to posts
    			wp_set_object_terms( $object->ID, $posttags, 'post_tag', true );
    
    			// Clean cache
    			if ( 'page' == $object->post_type ) {
    				clean_page_cache($object->ID);
    			} else {
    				clean_post_cache($object->ID);
    			}
    		}
    
    		return true;
    	}
    
    	/**
    	 * Format content to make searching for new tags easier
    	 *
    	 * @param string $content
    	 * @return string
    	 */
    	protected function FormatContent($content=""){
    
    		ShowDebugAutoTag("IN FormatContent $content");
    
    		if(!empty($content)){
    
    			// if we are auto discovering tags then we need to reformat words next to full stops so that we don't get false positives
    			if($this->autodiscover){
    				// ensure capitals next to full stops are decapitalised but only if the word is single e.g
    				// change ". The world" to ". the" but not ". United States"
    				$content = preg_replace("/(\.[”’\"]?\s*[A-Z][a-z]+\s[a-z])/e","strtolower('$1')",$content);
    			}
    
    			// remove plurals
    			$content = preg_replace("/(\w)([‘'’]s )/i","$1 ",$content);
    
    			ShowDebugAutoTag("REMOVE NON LETTERS OR NUMBERS");
    
    			// now remove anything not a letter or number
    			$content = utf8_decode( preg_replace("/[^\w\d\s\.,]/u"," ",utf8_encode($content)));
    
    			// replace new lines with a full stop so we don't get cases of two unrelated strings being matched
    			$content = preg_replace("/\r\n/",". ",$content);
    
    			// remove excess space
    			$content = preg_replace("/\s{2,}/"," ",$content);			
    
    		}
    
    		ShowDebugAutoTag("RETURN $content");
    
    		return $content;
    
    	}
    
    	/**
    	 * Checks a word to see if its a known noise word
    	 *
    	 * @param string $word
    	 * @return boolean
    	 */
    	protected function IsNoiseWord($word){
    
    		$count = preg_match($this->isnoisewordregex,$word,$match);
    
    		if(count($match)>0){
    			return true;
    		}else{
    			return false;
    		}
    	}
    
    	/**
    	 * Checks whether a word is a roman numeral
    	 *
    	 */
    	function IsRomanNumeral($word){
    
    		if(preg_match("/^M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$/",$word)){
    			return true;
    		}else{
    			return false;
    		}
    	}
    
    	/*
    	 * removes noise words from a given string
    	 *
    	 * @param string
    	 * @return string
    	 */
    	protected function RemoveNoiseWords($content){		
    
    		$content = preg_replace($this->removenoisewordsregex," ",$content);
    
    		return $content;
    	}
    
    	/*
    	 * counts the number of words that capitalised in a string
    	 *
    	 * @param string
    	 * @return integer
    	 */
    	protected function CountCapitals($words){
    
    		$no_caps =	preg_match_all("/\b[A-Z][A-Za-z]*\b/",$words,$matches);			
    
    		return $no_caps;
    	}
    
    	/*
    	 * strips all non words from a string
    	 *
    	 * @param string
    	 * @return string
    	 */
    	protected function StripNonWords($words){
    
    		ShowDebugAutoTag("IN StripNonWords = " . $words);
    
    		// strip everything not space or uppercase/lowercase
    		$words = preg_replace("/[^A-Za-z\s]/","",$words);
    
    		ShowDebugAutoTag("NOW StripNonWords = " . $words);
    
    		return $words;
    	}
    
    	/**
    	 * Searches the passed in content looking for Acronyms to add to the search tags array
    	 *
    	 * @param string $content
    	 * @param array $searchtags
    	 */
    	protected function MatchAcronyms($content,&$searchtags){
    
    		ShowDebugAutoTag("IN MatchAcronyms");
    
    		// easiest way to look for keywords without some sort of list is to look for Acronyms like CIA, AIG, JAVA etc.
    		// so use a regex to match all words that are pure capitals 2 chars or more to skip over I A etc
    		//preg_match_all("/\b([A-Z]{2,})\b/u",$content,$matches,PREG_SET_ORDER);
    
    		preg_match_all("/\b(\p{Lu}{2,})\b/u",utf8_encode($content),$matches,PREG_SET_ORDER);
    
    		if($matches){
    
    			foreach($matches as $match){
    
    				$pat = utf8_decode($match[1]);
    
    				// ignore noise words who someone has capitalised!
    				if(!$this->IsNoiseWord($pat) && !$this->IsRomanNumeral($pat)){
    					// add in the format key=value to make removing items easy and quick plus we don't waste overhead running
    					// array_unique to remove duplicates!					
    
    					$searchtags[$pat] = trim($pat);
    
    					ShowDebugAutoTag("found possible Acronym ='" . trim($pat) . "'");
    				}
    			}
    		}
    
    		unset($match,$matches);
    
    	}
    
    	/**
    	 * Searches the passed in content looking for Countries to add to the search tags array
    	 *
    	 * @param string $content
    	 * @param array $searchtags
    	 */
    	protected function MatchCountries($content,&$searchtags){
    
    		ShowDebugAutoTag("IN MatchCountries");
    		preg_match_all("/\s(Afghanistan|Albania|Algeria|American\sSamoa|Andorra|Angola|Anguilla|Antarctica|Antigua\sand\sBarbuda|Arctic\sOcean|Argentina|Armenia|Aruba|Ashmore\sand\sCartier\sIslands|Australia|Austria|Azerbaijan|Bahrain|Baker\sIsland|Bangladesh|Barbados|Bassas\sda\sIndia|Belarus|Belgium|Belize|Benin|Bermuda|Bhutan|Bolivia|Bosnia\sand\sHerzegovina|Botswana|Bouvet\sIsland|Brazil|British\sVirgin\sIslands|Brunei|Bulgaria|Burkina\sFaso|Burma|Burundi|Cambodia|Cameroon|Canada|Cape\sVerde|Cayman\sIslands|Central\sAfrican\sRepublic|Chad|Chile|China|Christmas\sIsland|Clipperton\sIsland|Cocos\s(Keeling)\sIslands|Colombia|Comoros|Congo|Cook\sIslands|Coral\sSea\sIslands|Costa\sRica|Croatia|Cuba|Cyprus|Czech\sRepublic|Denmark|Djibouti|Dominica|Dominican\sRepublic|Ecuador|Eire|Egypt|El\sSalvador|Equatorial\sGuinea|England|Eritrea|Estonia|Ethiopia|Europa\sIsland|Falkland\sIslands\s|Islas\sMalvinas|Faroe\sIslands|Fiji|Finland|France|French\sGuiana|French\sPolynesia|French\sSouthern\sand\sAntarctic\sLands|Gabon|Gaza\sStrip|Georgia|Germany|Ghana|Gibraltar|Glorioso\sIslands|Greece|Greenland|Grenada|Guadeloupe|Guam|Guatemala|Guernsey|Guinea|Guinea-Bissau|Guyana|Haiti|Heard\sIsland\sand\sMcDonald\sIslands|Holy\sSee\s(Vatican\sCity)|Honduras|Hong\sKong|Howland\sIsland|Hungary|Iceland|India|Indonesia|Iran|Iraq|Ireland|Israel|Italy|Ivory\sCoast|Jamaica|Jan\sMayen|Japan|Jarvis\sIsland|Jersey|Johnston\sAtoll|Jordan|Juan\sde\sNova\sIsland|Kazakstan|Kenya|Kingman\sReef|Kiribati|Korea|Korea|Kuwait|Kyrgyzstan|Laos|Latvia|Lebanon|Lesotho|Liberia|Libya|Liechtenstein|Lithuania|Luxembourg|Macau|Macedonia\sThe\sFormer\sYugoslav\sRepublic\sof|Madagascar|Malawi|Malaysia|Maldives|Mali|Malta|Man\sIsle\sof|Marshall\sIslands|Martinique|Mauritania|Mauritius|Mayotte|Mexico|Micronesia\sFederated\sStates\sof|Midway\sIslands|Moldova|Monaco|Mongolia|Montenegro|Montserrat|Morocco|Mozambique|Namibia|Nauru|Navassa\sIsland|Nepal|Netherlands|Netherlands\sAntilles|New\sCaledonia|New\sZealand|Nicaragua|Nigeria|Niue|Norfolk\sIsland|Northern\sIreland|Northern\sMariana\sIslands|Norway|Oman|Pakistan|Palau|Palmyra\sAtoll|Panama|Papua\sNew\sGuinea|Paracel\sIslands|Paraguay|Peru|Philippines|Pitcairn\sIslands|Poland|Portugal|Puerto\sRico|Qatar|Reunion|Romania|Russia|Rwanda|Saint\sHelena|Saint\sKitts\sand\sNevis|Saint\sLucia|Saint\sPierre\sand\sMiquelon|Saint\sVincent\sand\sthe\sGrenadines|San\sMarino|Sao\sTome\sand\sPrincipe|Saudi\sArabia|Scotland|Senegal|Serbia|Seychelles|Sierra\sLeone|Singapore|Slovakia|Slovenia|Solomon\sIslands|Somalia|South\sAfrica|South\sGeorgia\sand\sthe\sSouth\sSandwich\sIslands|Spain|Spratly\sIslands|Sri\sLanka|Sudan|Suriname|Svalbard|Swaziland|Sweden|Switzerland|Syria|Taiwan|Tajikistan|Tanzania|Thailand|The\sBahamas|The\sGambia|Togo|Tokelau|Tonga|Trinidad\sand\sTobago|Tromelin\sIsland|Tunisia|Turkey|Turkmenistan|Turks\sand\sCaicos\sIslands|Tuvalu|Uganda|Ukraine|United\sArab\sEmirates|UAE|United\sKingdom|UK|United\sStates\sof\sAmerica|USA|Uruguay|Uzbekistan|Vanuatu|Venezuela|Vietnam|Virgin\sIslands|Wake\sIsland|Wales|Wallis\sand\sFutuna|West\sBank|Western\sSahara|Western\sSamoa|Yemen|Zaire|Zambia|Zimbabwe|Europe|Western\sEurope|North\sAmerica|South\sAmerica|Asia|South\sEast\sAsia|Central\sAsia|The\sCaucasus|Middle\sEast|Far\sEast|Scandinavia|Africa|North\sAfrica|North\sPole|South\sPole|Central\sAmerica|Caribbean)\s/i",$content,$matches, PREG_SET_ORDER);
    
    		if($matches){
    
    			foreach($matches as $match){
    
    				$pat = $match[1];
    
    				$searchtags[$pat] = trim($pat);
    
    				ShowDebugAutoTag("found country ='" . trim($pat) . "'");
    			}
    		}
    
    		unset($match,$matches);
    
    	}
    
    	/**
    	 * Searches the passed in content looking for Countries to add to the search tags array
    	 *
    	 * @param string $content
    	 * @param array $searchtags
    	 */
    	protected function MatchNames($content,&$searchtags){
    
    		ShowDebugAutoTag("IN MatchNames = " . $content);
    
    		// look for names of people or important strings of 2+ words that start with capitals e.g Federal Reserve Bank or Barack Hussein Obama
    		// this is not perfect and will not handle Irish type surnames O'Hara etc
    
    		//  preg_match_all("/((\b[A-Z][^A-Z\s\.,;:]+)(\s+[A-Z][^A-Z\s\.,;:]+)+\b)/u",$content,$matches,PREG_SET_ORDER);
    
    		//preg_match_all("/((\b\p{Uppercase_Letter}(?:\p{Lowercase_Letter}|[^\s\.,;:])+)(\s+\p{Uppercase_Letter}(?:\p{Lowercase_Letter}|[^\s\.,;:])+)+\b)/u",$content,$matches,PREG_SET_ORDER);
    
    		preg_match_all("/((\b\p{Lu}(?:\p{Ll}|[^\s\.,;:])+)(\s+\p{Lu}(?:\p{Ll}|[^\s\.,;:])+)+\b)/u",utf8_encode($content),$matches,PREG_SET_ORDER);
    
    		//preg_match_all("/((\b[A-Z][^A-Z\s\.,;:]+)(\s+[A-Z][^A-Z\s\.,;:]+)+\b)/u",utf8_encode($content),$matches,PREG_SET_ORDER);
    
    		ShowDebugAutoTag("well >> ");
    		ShowDebugAutoTag($matches);
    
    		// found some results
    		if($matches){
    
    			foreach($matches as $match){
    
    				ShowDebugAutoTag("found possible name B4 utf8 decode ='" . trim($match[1]) . "'");
    
    				$pat = utf8_decode($match[1]);
    
    				$searchtags[$pat] = trim($pat);
    
    				ShowDebugAutoTag("found possible name ='" . trim($pat) . "'");
    			}
    		}
    
    		unset($match,$matches);
    	}
    
    	/**
    	 * formats strings so they can be used in regular expressions easily by escaping special chars used in pattern matching
    	 *
    	 * @param string $input
    	 * @return string
    	 */
    	protected function FormatRegEx($input){
    
    		$input = preg_replace("@([$^|()*+?.\[\]{}])@","\\\\$1",$input);
    
    		return $input;
    	}
    
    	/**
    	 * check the content to see if the amount of content that is parsable is above the allowed threshold
    	 *
    	 * @param string
    	 * @return boolean
    	 */
    	protected function ValidContent($content){
    
    		ShowDebugAutoTag("IN ValidContent = $content");
    
    		// strip everything not space or uppercase/lowercase letters
    		$content	= $this->StripNonWords($content);
    
    		ShowDebugAutoTag("after non words stripped = $content");
    
    		// count the total number of words
    		$word_count = str_word_count($content);
    
    		ShowDebugAutoTag("word count = $word_count");
    
    		// no words? nothing to analyse
    		if($word_count == 0){
    			return false;
    		}
    
    		// count the number of capitalised words
    		$capital_count = $this->CountCapitals($content);
    
    		ShowDebugAutoTag("capital count = $capital_count");
    
    		if($capital_count > 0){
    			// check percentage - if its set to 0 then we can only skip the content if its all capitals
    			if($this->ignorepercentage > 0){
    				$per = round(($capital_count / $word_count) * 100);
    
    				ShowDebugAutoTag("% of capitals in content is $per is it > than " .  $this->ignorepercentage . "?");
    
    				if($per > $this->ignorepercentage){
    					return false;
    				}
    			}else{
    				if($word_count == $capital_count){
    					return false;
    				}
    			}
    		}
    
    		return true;
    	}
    
    	/**
    	 * Parse post content to discover new tags and then rank matching tags so that only the most appropriate are added to a post
    	 *
    	 * @param object $object
    	 * @return array
    	 */
    	public function AutoTag($object){
    
    		// skip posts with tags already added
    		/*
    		if ( get_the_tags($object->ID) != false) {
    			return false;
    		}
    		*/
    
    		// tags to add to post
    		$addtags = array();
    
    		// stack used for working out which tags to add
    		$tagstack = array();
    
    		// potential tags to add
    		$searchtags = array();
    
    		$article 	=	html_entity_decode($object->post_content);
    
    /*
    		//preg_match_all("@<(strong|h[1-6]|em|a)[^>]*>([\s\S]+?)<\/?(strong|h[1-6]|em|a)>@i",$article,$matches,PREG_SET_ORDER);
    		//preg_match_all("@.*<(?:strong|(?:h[1-6])|em|a)[^>]*>([\s\S]+?)<\/?(?:strong|(?:h[1-6])|em|a)>.*@i",$article,$matches,PREG_SET_ORDER);
    
    		preg_match_all("@[\s\S]+?<(?:a|em|strong)[^>]*>([\s\S]+?)<\/?(?:a|em|strong)>[\s\S]+?@i",$article,$matches,PREG_SET_ORDER);
    
    		//preg_match_all("@<(strong|h[1-6]|em|a)[^>]*>(.|\n)+?<\/?\1>@i",$article,$matches,PREG_SET_ORDER);
    		//<([^> ]+)[^>]*>(.|\n)+?<\/?\1>
    
    		print_r($matches);
    
    		preg_match_all("@[\s\S]*?<h[1-6][^>]*>([\s\S]+?)<\/?h[1-6]>[\s\S]+?@i",$article,$matches,PREG_SET_ORDER);
    
    			//print_r($matches);
    
    		if($matches){
    
    			ShowDebugAutoTag("we got special content");
    
    			foreach($matches as $match){
    				//echo $match . " - ";
    				//print_r($match);
    				ShowDebugAutoTag("match = " . $match[1]);
    				//ShowDebugAutoTag("match = " . $match[1][0]);
    				//print_r($item);
    			}
    		}
    
    		ShowDebugAutoTag("what did we get");
    		die;
    */
    		// ensure all html entities have been decoded
    		$article	= html_entity_decode(strip_tags($object->post_content));
    		$excerpt	= html_entity_decode($object->post_excerpt);
    		$title		= html_entity_decode($object->post_title);
    
    		// no need to trim as empty checks for space
    		if(empty($article) && empty($excerpt) && empty($title)){
    			return $addtags;
    		}
    
    		// if we are looking for new tags then check the major sections to see what percentage of words are capitalised
    		// as that makes it hard to look for important names and strings
    		if($this->autodiscover){
    
    			$discovercontent = "";
    
    			ShowDebugAutoTag("is the title valid for searching?");
    
    			// ensure title is not full of capitals
    			if($this->ValidContent($title)){
    				ShowDebugAutoTag("Title is valid");
    
    				$discovercontent .= " " . $title . ". ";
    			}
    
    			ShowDebugAutoTag("is the content valid for searching?");
    
    			// ensure article is not full of capitals
    			if($this->ValidContent($article)){
    				ShowDebugAutoTag("Article is valid");
    
    				$discovercontent .= " " . $article . " ";
    			}
    
    			ShowDebugAutoTag("is the excerpt valid for searching?");
    
    			// ensure excerpt  is not full of capitals
    			if($this->ValidContent($excerpt)){
    				ShowDebugAutoTag("Excerpt is valid");
    
    				$discovercontent .= " " . $excerpt . " ";
    			}
    
    		}else{
    			$discovercontent	= "";
    		}
    
    		ShowDebugAutoTag("Our discover content is = '" . $discovercontent . "'");
    
    		// if we are doing a special parse of the title we don't need to add it to our content as well
    		if($this->ranktitle){
    			$content			= " " . $article . " " . $excerpt . " ";
    		}else{
    			$content			= " " . $article . " " . $excerpt . " " . $title . " ";
    		}
    
    		// set working variable which will be decreased when tags have been found
    		$maxtags			= $this->maxtags;
    
    		// reformat content to remove plurals and punctuation
    		$content			= $this->FormatContent($content);
    		$discovercontent	= $this->FormatContent($discovercontent);
    
    		ShowDebugAutoTag("the discover content = " . $discovercontent);
    
    		// now if we are looking for new tags and we actually have some valid content to check
    		if($this->autodiscover && !empty($discovercontent)){
    
    			// look for Acronyms in content
    			// the searchtag array is passed by reference to prevent copies of arrays and merges later on
    			$this->MatchAcronyms($discovercontent,$searchtags);		
    
    			// look for countries as these are used as tags quite a lot
    			$this->MatchCountries($discovercontent,$searchtags);
    
    			// look for names and important sentences 2-4 words all capitalised
    			$this->MatchNames($discovercontent,$searchtags);
    		}
    
    		// get existing tags from the DB as we can use these as well as any new ones we just discovered
    		//global $wpdb;
    
    		// just get all the terms from the DB in array format
    
    		$dbterms =  array("conspiracy","Alex Jones","Québec"); //" Patrick Lagacé",
    
    		// if we have got some names and Acronyms then add them to our DB terms
    		// as well as the search terms we found
    		$c = count($searchtags);
    		$d = count($dbterms);
    
    		ShowDebugAutoTag("total search tags = $c and from the DB = $d");
    
    		if($c > 0 && $d > 0){
    
    			// join the db terms to those we found earlier
    			$terms = array_merge($dbterms,$searchtags);
    
    			// remove duplicates which come from discovering new tags that already match existing stored tags
    			$terms = array_unique($terms);
    
    		}elseif($c > 0){
    
    			// just set terms to those we found through autodiscovery
    			$terms = $searchtags;
    
    		}elseif($d > 0){
    
    			// just set terms to db results
    			$terms = $dbterms;
    		}
    
    		ShowDebugAutoTag("our full list of terms to search");
    		ShowDebugAutoTag($terms);
    
    		// clean up
    		unset($searchtags,$dbterms);
    
    		// if we have no terms to search with then quit now
    		if(!isset($terms) || !is_array($terms)){
    			// return empty array
    			return $addtags;
    		}
    
    		// do we rank terms in the title higher?
    		if($this->ranktitle){
    
    			ShowDebugAutoTag("search within title");
    
    			// parse the title with our terms adding tags by reference into the tagstack
    			// as we want to ensure tags in the title are always tagged we tweak the hitcount by adding 1000
    			// in future expand this so we can add other content to search e.g anchors, headers each with their own ranking
    			$this->SearchContent($title,$terms,$tagstack,1000);
    		}
    
    		ShowDebugAutoTag("search within content");
    
    		// now parse the main piece of content
    		$this->SearchContent($content,$terms,$tagstack,0);
    
    		// cleanup
    		unset($terms,$term);
    
    		// take the top X items
    		if($maxtags != -1 && count($tagstack) > $maxtags){
    
    			// sort our results in decending order using our hitcount
    			uasort($tagstack, array($this,'HitCount'));
    
    			// return only the results we need
    			$tagstack = array_slice($tagstack, 0, $maxtags);
    		}
    
    		// add our results to the array we return which will be added to the post
    		foreach($tagstack as $item=>$tag){
    			$addtags[] = $tag['term'];
    		}
    
    		// we don't need to worry about dupes e.g tags added when the rank title check ran and then also added later
    		// as WordPress ensures duplicate taxonomies are not added to the DB
    
    		ShowDebugAutoTag("final array of post tags");
    		ShowDebugAutoTag($addtags);
    
    		// return array of post tags
    		return $addtags;
    
    	}
    
    	/**
    	 * parses content with a supplied array of terms looking for matches
    	 *
    	 * @param string content
    	 * @param array $terms
    	 * @param array $tagstack
    	 * @param integer $tweak
    	 */
    	protected function SearchContent($content,$terms,&$tagstack,$tweak){
    
    		if(empty($content) || !is_array($terms) || !is_array($tagstack)){
    			return;
    		}
    
    		//$content = preg_replace("/\./"," ",$content);
    
    		$content = $this->RemoveNoiseWords($content);
    
    		// now loop through our content looking for the highest number of matching tags as we don't want to add a tag
    		// just because it appears once as that single word would possibly be irrelevant to the posts context.
    		foreach($terms as $term){
    
    			// safety check in case some BS gets into the DB!
    			if(strlen($term) > 1){
    
    				// for an accurate search use preg_match_all with word boundaries
    				// as substr_count doesn't always return the correct number from tests I did
    
    				$regex = "/\b" . preg_quote( $term ) . "\b/";
    
    				ShowDebugAutoTag("regex to search with = " . $regex);
    
    				// added error handler @ to prevent unknown unknowns
    				$i = preg_match_all($regex,$content,$matches);
    
    				// if found then store it with the no of occurances it appeared e.g its hit count
    				if($i > 0){
    
    					ShowDebugAutoTag("found " . $i . " matches of " . $term);
    
    					// if we are tweaking the hitcount e.g for ranking title tags higher
    					if($tweak > 0){
    						$i = $i + $tweak;
    					}
    
    					// do we add all tags whether or not they appear nested inside other matches
    					if($this->nestedtags == AUTOTAG_BOTH){
    
    						ShowDebugAutoTag("ADD BOTH");
    
    						// add term and hit count to our array
    						$tagstack[] = array("term"=>$term,"count"=>$i);
    
    					// must be AUTOTAG_SHORT
    					}else{
    
    						ShowDebugAutoTag("MUST BE SHORT");
    
    						$ignore = false;
    
    						// loop through existing tags checking for nested matches e.g New York appears in New York City
    						foreach($tagstack as $key=>$value){
    
    							$oldterm = $value['term'];
    							$oldcount= $value['count'];
    
    							// check whether our new term is already in one of our old terms
    							if(stripos($oldterm,$term)!==false){
    
    								// we found our term inside a longer one and as we are keeping the shortest version we need to add
    								// the other tags hit count before deletng it as if it was a ranked title we want this version to show
    								$i = $i + (int)$oldcount;
    
    								// remove our previously stored tag as we only want the smallest version
    								unset($tagstack[$key]);
    
    							// check whether our old term is in our new one
    							}elseif(stripos($term,$oldterm)!==false){
    
    								// yes it is so keep our short version in the stack and ignore our new term
    								$ignore = true;
    								break;
    							}
    						}
    
    						ShowDebugAutoTag("ignore = " . $ignore);
    
    						// do we add our new term
    						if(!$ignore){
    							// add term and hit count to our array
    							$tagstack[] = array("term"=>$term,"count"=>$i);
    						}
    					}
    				}
    			}
    		}
    
    		// the $tagstack was passed by reference so no need to return it
    	}
    
    	/**
    	 * used when sorting tag hit count to compare two array items hitcount
    	 *
    	 * @param array $a
    	 * @param array $b
    	 * @return integer
    	 */
    	protected function HitCount($a, $b) {
    		return $b['count'] - $a['count'];
    	}
    
    }
    
    ShowDebugAutoTag("starting");
    
    class postobj{
    
    	public $post_content = "<p><span>??è Exemple : name to ??? tag - Patrick Lagacé Québec the tag is named Patrick Lagacé - name to tag ?é? èep - Québec the tag is named Québec</span></p>";
    
    	public $post_title = "Patrick Lagacé  says hello";
    
    	public $post_excerpt = "";
    
    }
    ShowDebugAutoTag("start");
    
    // create auto tag object
    $strictlyautotags = new StrictlyAutoTags();
    
    $object = new postobj();
    
    $tags = $strictlyautotags->AutoTag($object);
    
    ShowDebugAutoTag("got tags");
    
    ShowDebugAutoTag($tags);
    ?>
    Thread Starter nosaint

    (@nosaint)

    yes, you said, but to understand would mean to have php knowledge, wich i don’t ??

    by the way, is there a way to have the plugin considering the special characters as normal characters? I mean, to treat ?, ?, a, ?, ? as a, i, a, s, t? For me, this would be ok even if is changing the words a little bit…

    Plugin Author strictly-software

    (@strictly-software)

    Without debugging WordPresses own code there is no way to find out what the issue is.

    If the articles contained English characters e.g only A-Z or a-z then it should be fine so just make sure you replace those letters with whatever you feel corresponds to them. Remember just because ? look a bit like a doesn’t mean that its logically equivalent and you could make nonsensical terms up by doing that.

    I don’t understand something with UTF-8 characters. My WordPress character set is UTF-8 but my blog’s language is english. Should I change the charset parameter in blog header to iso-8859-1/windows-1252 ?

    I’ve choosen UTF-8 among other reasons because W3C instructions say:
    We recommend the use of UTF-8 wherever possible. (bold and red text on top)

Viewing 8 replies - 1 through 8 (of 8 total)
  • The topic ‘[Plugin: Strictly Auto Tags] Striclty Auto Tag – special characters problem’ is closed to new replies.