README.html

<html>
<head>
	<title>Propose Changes to clojure.contrib.str-utils</title>
	<style>
	.code-block{
		background-color:#FFFFCC;
		border-color:#CCCCCC;
		border-width:2px;
		border-style:solid;
		width:35em;
		margin-left:auto;
		margin-right:auto;	
		margin-top:10px;
		margin-bottom:10px;
		padding:5px;
	}
	body{
		width:40em;
		margin-left:auto;		
		margin-right:auto;		
		background-color:#3333FF;
		margin-top:0px;
		margin-bottom:0px;
	}
	#page-content{
		border-color:#990000;
		border-width:0px 2px 0px 2px;
		border-style:solid;
		background-color:#FFFFFF;
		padding:5px;
		margin:0px;
	}
	.ns{
		color:green;
		font-weight:bold;
	}
	</style>
</head>
<body>
	<div id="page-content">
		<h1 style="text-align:center">My Proposed changes to str-utils </h1>
		<h4 style="text-align:center">Sean Devlin</h4>
		<h4 style="text-align:center">May 13, 2009</h4>
		<p>
			I've been reviewing the str-utils package, and I'd like to propose a few changes to the library.  I've included the code at the bottom.
		</p>

		<h2>Use Multi-Methods</h2>
		<p>
			I'd like to propose re-writing the following methods to used multi-methods.  Every single method will take an input called input-string,
			and a variable set of inputs called remaining-inputs.  The mutli-dispatch will make decide what to do based on the remaining inputs.
			Specifically, I've used
		</p>
		<div class="code-block">		
			<code>
				(class (first remaining-inputs))<br>
			</code>
		</div>
		<p>
			repeatedly.  The two most interesting classes are <code class="ns">java.util.regex.Pattern</code>, and <code class="ns">clojure.lang.PersistentList</code>.  I deliberately decided to <b>not</b>
			use sequences, because I believed order was important.  One method takes a map as an input, but this is so that a tuple could be passed as an options
			hash.
		</p>

		<h2><code>re-partion[input-string & remaining-inputs](...)</code></h2>
		<div class="function-description">
			This methods behaves like the original re-partition method, with the remaining-inputs being able to a list or a pattern.  It returns a lazy sequence, and
			is used as a basis for for several other methods.
		</div>

		<h2><code>re-split[input-string & remaining-inputs](...)</code></h2>
		<div class="function-description">
			The remaining inputs can be dispatched based on a regex pattern, a list of patterns, or a map.  The regex method is the basis, and does the actual work.<br>

			<h4>Regex</h4>
			This method splits a string into a list based on a regex.  It depends on the re-partition method, and returns a lazy sequence.<br>

			<div class="code-block">
				<code>
					(re-split "1 2 3\n4 5 6" #"\n") => ("1 2 3" "4 5 6")<br>
				</code>
			</div>

			<h4>Map</h4>
			This splits each element based on the inputs of the map.  It is how options are passed to the method.<br>

			<div class="code-block">
				<code>
					(re-split "1 2 3" {:pattern #"\s+" :offset 1}) => (2.0 3.0)<br>
					(re-split "1 2 3" {:pattern #"\s+" :length 2}) => (1.0 2.0)<br>
					(re-split "1 2 3" {:pattern #"\s+" :marshal-fn #(java.lang.Double/parseDouble %)}) => (1.0 2.0 3.0)<br>
				</code>
			</div>

			The <code>:pattern</code>, <code>:offset</code>, and <code>:length</code> options are relatively straightforward.  The :marshal-fn is mapped after the string is split.<br>

			<h4>List</h4>
			This splits each element either like a map (datatype) or a regex.  The map operator is applied recursively to each element<br>

			<div class="code-block">
				<code>
					(re-split "1 2 3\n4 5 6" (list #"\n" #"\s+")) => (("1" "2" "3") ("4" "5" "6"))<br>
				</code>
			</div>

			<h4>Chaining</h4>
			These items can be chained together, as the following example shows<br>
			<div class="code-block">
				<code>
					(re-split "1 2 3\n4 5 6" <br>
					(list #"\n" {:pattern #"\s+" <br>
					:length 2 <br>
					:marshal-fn #(java.lang.Double/parseDouble %)}))<br>
					=> ((1.0 2.0) (4.0 5.0))<br>
				</code>
			</div>
			In my opinion, the <code>:marshal-fn</code> is best used at the end of the list.  However, it could be used earlier in the list, but a exception will most likely be thrown.
		</div>

		<h2><code>re-gsub[input-string & remaining-inputs](...)</code></h2>
		<div class="function-description">

			This method can take a list or two atoms as the remaining inputs.<br>

			Two atoms<br>
			<div class="code-block">
				<code>
					(re-gsub "1 2 3 4 5 6" #"\s" "") => "123456"<br>
				</code>
			</div>
			A paired list<br>
			<div class="code-block">
				<code>
					(re-gsub "1 2 3 4 5 6" '((#"\s" " ) (#"\d" "D"))) => "DDDDDD"<br>
				</code>
			</div>

		</div>

		<h2><code>re-sub[input-string & remaining-inputs](...)</code></h2>
		<div class="function-description">

			Again, this method can take a list or two atoms as the remaining inputs.<br>

			Two atoms<br>
			<div class="code-block">
				<code>
					(re-sub "1 2 3 4 5 6" #"\d" "D") => "D 2 3 4 5 6"<br>
				</code>
			</div>

			A paired list<br>
			<div class="code-block">
				<code>
					(re-sub "1 2 3 4 5 6" '((#"\d" "D") (#"\d" "E"))) => "D E 3 4 5 6"<br>
				</code>
			</div>

		</div>

		<h2>The <code>nearby</code> Function</h2>
		<p>
			The nearby function is designed to assist with a spell checker, inspired by the example from Peter Norvig.
			
			<h3>Signatures</h3>
			<code>
				<ul>
					<li><b>nearby</b> input-string</li>
					<li><b>nearby</b> input-string seq</li>
				</ul>
			</code>
			
			Here's an example.
			
			<div class="code-block">
				<code>					
					(nearby "cat" (seq "abc")) => <br>("act" "atc"<br>
						"acat" "aat" "caat" "cat" "caat" "caa" "cata"<br>
						"bcat" "bat" "cbat" "cbt" "cabt" "cab" "catb"<br>
						"ccat" "cat" "ccat" "cct" "cact" "cac" "catc")
				</code>
			</div>
			
			The resulting sequence is lazy.  In order to use it in a spellchecker, try using it like this:
			
			<div class="code-block">
				<code>					
					(set (take <i>number</i> (nearby "cat" (seq "abc"))))
				</code>
			</div>
			
			If the function is called with only one argument, it behaves like this.

			<div class="code-block">
				<code>					
					(nearby "cat") =><br> (nearby "cat" (cons "" "etaoinshrdlcumwfgypbvkjxqz"))
				</code>
			</div>
			
			The strange order was chosen because that is the english alphabet sorted by frequency.  This way the earliest entries will have the
			highest chance of being a valid word.
			
		</p>

		<h2>String Seq Utils</h2>
		<div class="function-description">
			The contrib version of str-utils contains the <code>str-join</code> function.  This is a string specific version of the more general <cod>interpose</code> function.  It inspired the creation of 
			four other functions, <code>str-take, str-drop, str-rest & str-reverse</code>.  The mimic the behavior of the regular sequence operations, with the exception that they return strings instead of
			a sequence.  Also, some of them can alternately take a regex as an input.<br>

			<h2><code>str-take</code></h2>
			<p>
				This function is designed to be similar to the <code>take</code> function from the core.  It specifically applies the <code>str</code> function to the resulting sequence.  Also, it can take a regex
				instead of an integer, and will take everything before the regex.  Be careful not to combine a regex and a sequence, as this will cause an error.  Finally, an optional <code>:include</code> parameter
				can be passed to include the matched regex.
			</p>
			<div class="code-block">	
				<code>
					<table>
						<tr>
							<td>(str-take 7 "Clojure Is Awesome")</td>
							<td>=></td>
							<td>"Clojure"</td>
						</tr>
						<tr>
							<td>(str-take 2 ["Clojure" "Is" "Awesome"])</td>
							<td>=></td>
							<td>"ClojureIs"</td>
						</tr>
						<tr>
							<td>(str-take #"\s+" "Clojure Is Awesome")</td>
							<td>=></td>
							<td>"Clojure"</td>
						</tr>
						<tr>
							<td>(str-take #"\s+" "Clojure Is Awesome" <br>&nbsp&nbsp {:include true})</td>
							<td>=></td>
							<td>"Clojure "</td>
						</tr>
						<tr>
							<td>(str-take #"\s+" ["Clojure" "Is" "Awesome"])</td>
							<td>=></td>
							<td style="color:red;"><b>error</b></td>
						</tr>
					</table>
				</code>
			</div>

			<h2><code>str-drop</code></h2>
			<p>
				This function is designed to be similar to the <code>drop</code> function from the core.  It specifically applies the <code>str</code> function to the resulting sequence.  Also, it can take a regex
				instead of an integer, and will take everything after the regex.  Be careful not to combine a regex and a sequence, as this will cause an error.  Finally, an optional <code>:include</code> parameter
				can be passed to include the matched regex.
			</p>
			
			<div class="code-block">	
				<code>
					<table>
						<tr>
							<td>(str-drop 8 "Clojure Is Awesome")</td>
							<td>=></td>
							<td>"Is Awesome"</td>
						</tr>
						<tr>
							<td>(str-drop 1 ["Clojure" "Is" "Awesome"])</td>
							<td>=></td>
							<td>"IsAwesome"</td>
						</tr>
						<tr>
							<td>(str-drop #"\s+" "Clojure Is Awesome")</td>
							<td>=></td>
							<td>"Is Awesome"</td>
						</tr>
						<tr>
							<td>(str-drop #"\s+" "Clojure Is Awesome" <br>&nbsp&nbsp {:include true})</td>
							<td>=></td>
							<td>" Is Awesome"</td>
						</tr>
						<tr>
							<td>(str-drop #"\s+" ["Clojure" "Is" "Awesome"])</td>
							<td>=></td>
							<td style="color:red;"><b>error</b></td>
						</tr>						
					</table>
				</code>
			</div>

			<h2><code>str-rest</code></h2>
			<p>This function applies <code>str</code> to the <code>rest</code> of the input.  It is equivalent to <code>(str-drop 1 <i>input</i>)</code></p>
			<div class="code-block">	
				<code>
					<table>
						<tr>
							<td>(str-rest (str :Clojure))</td>
							<td>=></td>
							<td>"Clojure"</td>
						</tr>
					</table>
				</code>
			</div>

			<!--
			<table>
				<tr>
					<td></td>
					<td>=></td>
					<td></td>
				</tr>
			</table> -->

			<h2><code>str-reverse</code></h2>
			<div class="brief-function-description">
				This methods reverses a string<br>
				<div class="code-block">
					<code>
						(str-reverse "Clojure") => "erujolC"
					</code>
				</div>
			</div>

			<h3>An Example</h3>
			These methods can be used to help parse strings, such as below.<br>
			<div class="code-block">
				<code>
					(str-take "&gt" (str-drop  #"&lt h4" "&lt h4 ... &gt")) <br>=> ;the stuff in the middle
				</code>
			</div>

		</div>

		<h2>
			New Inflectors
		</h2>
		I've added a few inflectors that I am familiar with from Rails.  My apologies if their origin is anther language.  I'd be interested in knowing where the method originated


		<h4>trim</h4>
		<div class="brief-function-description">
			This is a convenience wrapper for the trim method java supplies<br>
			<div class="code-block">
				<code>
					(trim "  Clojure  ") => "Clojure"
				</code>
			</div>
		</div>

		<h4>strip</h4>
		<div class="brief-function-description">
			This is an alias for trim.  I accidently switch between <code>trim</code> and <code>strip</code> all the time.<br>
			<div class="code-block">
				<code>
					(strip "  Clojure  ") => "Clojure"
				</code>
			</div>
		</div>

		<h4>ltrim</h4>
		<div class="brief-function-description">
			This method removes the leading whitespace<br>
			<div class="code-block">
				<code>
					(ltrim "  Cloure  ") => "Clojure  "
				</code>
			</div>
		</div>

		<h4>rtrim</h4>
		<div class="brief-function-description">
			This method removes the trailing whitespace<br>
			<div class="code-block">
				<code>
					(ltrim "  Cloure  ") => "  Clojure"
				</code>
			</div>
		</div>

		<h4>downcase</h4>
		<div class="brief-function-description">
			This is a convenience wrapper for the toLowerCase method java supplies<br>
			<div class="code-block">
				<code>
					(downcase "Clojure") => "clojure"
				</code>
			</div>
		</div>

		<h4>upcase</h4>
		<div class="brief-function-description">
			This is a convenience wrapper for the toUpperCase method java supplies<br>
			<div class="code-block">
				<code>
					(upcase "Clojure") => "CLOJURE"
				</code>
			</div>
		</div>

		<h4>capitalize</h4>
		<div class="brief-function-description">
			This method capitalizes a string<br>
			<div class="code-block">
				<code>
					(capitalize "clojure") => "Clojure"
				</code>
			</div>
		</div>

		<h4>titleize, camelize, dasherize, underscore</h4>
		<div class="brief-function-description">
			These methods manipulate "sentences", producing a consistent output.  Check the unit tests for more examples<br>
			<div class="code-block">
				<code>
					<table>
						<tr>
							<td>(titleize "clojure iS Awesome")</td>
							<td>=></td>
							<td>"Clojure Is Awesome"</td>
						</tr>
						<tr>
							<td>(camleize "clojure iS Awesome")</td>
							<td>=></td>
							<td>"clojureIsAwesome"</td>
						</tr>
						<tr>
							<td>(dasherize "clojure iS Awesome")</td>
							<td>=></td>
							<td>"clojure-is-awesome"</td>
						</tr>
						<tr>
							<td>(underscore "clojure iS Awesome")</td>
							<td>=></td>
							<td>"clojure_is_awesome"</td>
						</tr>
					</table>
				</code>
			</div>
		</div>
		
		<h4>pluralize</h4>
		<div>
			This is an early attempt at Rails' <code>pluralaize</code> function. The code for the <code>pluralize</code>
			function was based on functions contributed by Brian Doyle.
			<div class="code-block">
				<code>
					<table>
						<tr>
							<td>(pluralize "foo")</td>
							<td>=></td>
							<td>"foos"</td>
						</tr>
						<tr>
							<td>(pluralize "beach")</td>
							<td>=></td>
							<td>"beaches"</td>
						</tr>
						<tr>
							<td>(pluralize "baby")</td>
							<td>=></td>
							<td>"babies"</td>
						</tr>
						<tr>
							<td>(pluralize "bus")</td>
							<td>=></td>
							<td>"buses"</td>
						</tr>
					</table>
				</code>
			</div>
		</div>

		<h4>singularize</h4>
		<div>
			This is an early attempt at Rails' <code>singularize</code> function. The code for the <code>singulaize</code>
			function was based on functions contributed by Brian Doyle.
			<div class="code-block">
				<code>
					<table>
						<tr>
							<td>(singularize "foos")</td>
							<td>=></td>
							<td>"foo"</td>
						</tr>
						<tr>
							<td>(singularize "beaches")</td>
							<td>=></td>
							<td>"beach"</td>
						</tr>
						<tr>
							<td>(singularize "babies")</td>
							<td>=></td>
							<td>"baby"</td>
						</tr>
						<tr>
							<td>(singularize "stops")</td>
							<td>=></td>
							<td>"stop"</td>
						</tr>
					</table>
				</code>
			</div>
		</div>

		<h2>Closing thoughts</h2>
		<p>
			There are three more methods, <code>str-join, chop, & chomp</code> that were already in str-utils.  I changed the implementation of the methods, but the behavior should be the same.			
		</p>
		<p>
			There is a big catch with my proposed change.  The signature of re-split, re-partition, re-gsub and re-sub changes.  They will not be backwards compatible, and will break code.  However, I think the flexibility is worth it.
		</p>
		<h2>TO-DOs</h2>
		There are a few more things I'd like to add, but that could done at a later date.  

		<ul>
			<li>Add more inflectors</li>
		</ul>

		The following additions become pretty easy if the propsed re-gsub is included:
		<ul>
			<li>Add HTML-escape function (like Rails' <code>h</code> method)</li>
			<li>Add Javascript-escape function (like Rails' <code>escape_javascript</code> method)</li>
			<li>Add SQL-escape function</li>
		</ul>

		<p>
			Okay, that's everything I can think of for now.  I'd like to thank the Stuart Sierra, and all of the contributors to this library.  This is possible because I'm standing on their shoulders.
		</p>
	</div>
</body>
</html>