There are obvious downsides to this approach, such as the potential for false positives (good comments that are incorrectly classified as spam, perhaps due to the infamous Scunthorpe problem) as well as the high rate of false negatives (spam comments that are not recognized as such and have to be deleted manually). However, word blacklists are available as a built-in feature of WordPress, so I don’t have to use a paid subscription blog spam filtering service such as Akismet. Also, the simplicity and controllability of the approach are nice.

In the rest of this post, I will list and describe all of the string filters I use, so that other bloggers can copy them if so desired.

The single most effective set of blacklisted strings that I use is a short list of common Cyrillic characters. Since this is an English language blog but a great deal of spam is written in Russian (or pseudo-Russian gibberish), this filter is very powerful for its small size. The particular list of characters that I use is taken from an article elsewhere on the Internet which, sadly, I can no longer find. The list is as follows:

д и ж Ч Б Џ Ђ ћ Р° Ѓ

Another common language that I receive spam comments in is Japanese. Almost all Japanese text can be efficiently filtered out with this even shorter list of characters:

。 ー の

Next, we have the medications. This is a very effective filter, but unfortunately the list has to be updated frequently as the distribution of drugs being pushed in the spam I receive changes over time. Also, *cialis* cannot be included, since as Wikipedia notes, it is contained as a substring in the common word *specialist*; nor *ambien*, as it is a substring of *ambient*. The brand name *ultram* is probably safe, however, unless I start posting Warhammer 40K content.

adderall alprazolam clomid clonazepam clopidogrel diazepam doxycycline effexor ephedrine ivermectin klonopin lasix lunesta oxymorphone phentermine restoril retin a retin-a sildenafil tetracycline tramadol ultram valacyclovir valium viagra vicodin xanax zoloft zolpidem

Next up, we have distinctive phrases that occur in certain fixed spam messages that get posted over and over again. This filter is not very effective in the long run, since the particular spam messages tend to change over time, but as a short-term fix to get rid of individual really persistent spammers, it can work pretty well.

going to put you in the freezer as punishment hard to find your site in google hard to find your website in google I noticed that your On-Page SEO is is missing a few factors I noticed your site lost rank in google missing out on at least 300 visitors per day We have decided to open our POWERFUL and PRIVATE website traffic system

(Yes, the phrase “going to put you in the freezer as punishment” was actually present in a spam comment I received over and over again for a while several years ago. It’s from a joke about a guy putting his pet parrot in the freezer. Look it up if you are really desperate to know.)

Similarly, I also have a small pile of fixed URLs and website names that get spammed over and over again for a period of time. I’m not going to list them here, since including them seems likely to get this site banned from search engine results. Besides, they usually only work as filters for a short period of time before the spammers move on to greener pastures.

On the other end of the spectrum, we have common and widely-used phrases that happen to occur frequently in spam from many different sources while not being likely in legitimate comments relevant to the content on my blog. This is a particularly tricky category, since these phrases could easily occur in genuine comments if the subject matter of my blog strays too far into certain territory. Because of this, I only have two phrases of this sort blacklisted at the moment:

search engine optimization where to buy

The largest category of filtered strings that I have at the moment is types and brand names of products that the spam purports to offer for sale at cheap prices. This is another category where a level of care is required, because it would be easy to accidentally filter out a legitimate comment that just happens to mention one of these items. Here is the list I am currently using:

air jordan auto insurance babyliss burberry canada goose gucci handbag hermes high heel jerseys jimmy choo jordans lacoste louboutin louis vuitton lululemon marc jacobs michael kor michaelkor mlb jersey moncler nail art nba jersey newbalance nfl jersey nike oakley sunglasses payday loan prada ray ban uggs wholesale beads

And last but certainly not least, we have the vices. These should be reasonably safe to filter out as long as my blog doesn’t get too, uh, *spicy*.

casino erotic porn sexy

And there you have it. A few simple word filters can catch the majority of the spam comments this blog receives. Not bad for what it is.

]]>I discovered this bug when I wrote some code that compiled in Eclipse, committed it, and then got an email a few minutes later from our Jenkins continuous integration server saying that the build failed. From the error message, I managed to track it down to a specific section of code that compiled in Eclipse but gave a compile error in javac.

This isn’t the first time I’ve ran into a Java compiler or standard library bug while developing CertSAFE, nor is it the first time that I’ve submitted a bug report via the Oracle web form. However, it is the first time that I’ve had a report accepted and published as a verified OpenJDK bug.

I’m always happy when I find a compiler bug, because it makes me feel better about bugs in my code to know that the platform developers screw up too.

]]>data MergeableSet = ... type Elem = Int empty :: (Elem, Elem) -> MergeableSet singleton :: (Elem, Elem) -> Elem -> MergeableSet size :: MergeableSet -> Int toList :: MergeableSet -> [Elem] union :: MergeableSet -> MergeableSet -> MergeableSet

Seems fairly reasonable, right? I’m going to show that **it is likely that no such data structure exists**.

First, note that some very similar data structures do in fact exist. Haskell’s Data.Set can be used to implement this interface with \(O(1)\) `singleton`

and `size`

, \(O(\log(n))\) membership testing (which is obviously much more powerful than `toList`

), and \(O(n)\) `union`

. Brodal, Makris, and Tsichlas (2006) presented a purely functional data structure that has \(O(1)\) `singleton`

, \(O(\log(n))\) membership testing, and \(O(1)\) “`join`

“, which is the same as `union`

but requires every element in the first set to be strictly less than every element in the second set.

So why is the variant above so implausible?

If a `MergeableSet`

data structure with the given time bounds exists (even without the `size`

operation), then **it is possible to find the transitive closure of an \(n\)-vertex graph in near-optimal time \(O(n^2 \log(n)^c)\)**.

The algorithms for computing the transitive closure of a graph with the current best known worst-case runtime are based on algorithms for fast matrix multiplication. In particular, transitive closure of a \(n\)-vertex graph can be computed in time \(O(n^\omega)\) where \(\omega < 2.373\) is the best known exponent for matrix multiplication. A faster algorithm for transitive closure would actually give a faster algorithm for Boolean matrix multiplication as well, as noted by Fischer and Meyer (1971).

Now, the problem of finding the transitive closures of a general graph can be reduced to the problem of finding the transitive closure of a directed acyclic graph. We can just compute the strongly connected components of the graph using any of the several linear-time algorithms, then compute the transitive closure of the resulting kernel DAG. Looping over the pairs of vertices in the original graph to move back to the starting domain takes \(O(n^2)\) time, but since the size of the output is \(n^2\) bits anyway there’s no additional asymptotic cost.

Suppose then that `MergeableSet`

exists and we want to find the transitive closure of a DAG. We can associate to each vertex the set of vertices reachable from that vertex, stored as a `MergeableSet`

. By traversing the graph in reverse topological order and using `union`

to combine the sets of all of the vertices adjacent to each vertex, we can compute `MergeableSet`

s of reachable vertices for all vertices in \(O(n^2 \log(n)^d)\) time. Then we just loop over all \(n\) vertices and obtain their lists of reachable vertices using `toList`

, which also takes \(O(n^2 \log(n)^d)\) time. A Haskell implementation of this idea (adding a slight \(O(\log(n))\) overhead by using `Data.Map`

so that I don’t have to get mutable arrays involved) looks like this:

import qualified Data.Array as Array import Data.Graph import qualified Data.Map as Map dagTransitiveClosure :: Graph -> Graph dagTransitiveClosure g = buildG (Array.bounds g) transitiveClosureEdges where rs = reachableSets g transitiveClosureEdges = [(v1, v2) | v1 <- vertices g, v2 <- toList (rs Map.! v1), v1 /= v2] type ReachableSets = Map.Map Vertex MergeableSet reachableSets :: Graph -> ReachableSets reachableSets g = foldl addVertex Map.empty $ topSort $ transposeG g where addVertex :: ReachableSets -> Vertex -> ReachableSets addVertex rs v = Map.insert v reachableSet rs where reachableSet = foldl union (singleton (Array.bounds g) v) $ map (rs Map.!) $ g Array.! v

If a `MergeableSet`

data structure with the given time bounds exists (even without the `toList`

operation), then **Cnf-Sat, the Boolean satisfiability problem for formulas in conjunctive normal form, has a \(2^{\delta n} \cdot \text{poly}(m)\) algorithm for some \(\delta < 1\)**.

Pătrașcu and Williams (2010) gave several hypotheses under which Cnf-Sat would have substantially faster algorithms than brute-force search. One of their theorems is as follows: if a certain problem 2Sat+2Clauses can be solved in time \(O((n + m)^{2 – \epsilon})\) for any \(\epsilon > 0\), then Cnf-Sat with \(n\) variables and \(m\) clauses can be solved in time \(2^{\delta n} \cdot \text{poly}(m)\) for some \(\delta < 1\). They note in passing that 2Sat+2Clauses reduces in linear time to the following problem:

Given a directed graph \(G = (V, E)\) and subsets \(S, T \subseteq V\), determine if there is some \(s \in S\) and \(t \in T\) with no path from \(s\) to \(t\).

By computing the strongly-connected components of \(G\), we can again without loss of generality assume that \(G\) is acyclic.

Now suppose that `MergeableSet`

exists. Then it is possible to solve this problem in time \(O((n + m) \cdot \log(n)^c)\) for a graph with \(n\) vertices and \(m\) edges. First, we compute the set of vertices in \(T\) reachable from each vertex, using essentially the same algorithm as the one for transitive closure from before. Then we loop over each vertex in \(S\) and use `size`

to test whether the size of its reachable set is less than \(|T|\). If we find a vertex \(s\) where this is the case, then return true; otherwise, return false. (We can also find a specific vertex \(t\) with no path from \(s\) to \(t\) by depth-first search from \(s\).)

So, to summarize, `MergeableSet`

would dramatically improve upon the known upper bounds for graph reachability problems. It’s probably too good to be true.

**Source code and documentation for rulesgen are available on GitHub**.

{-# LANGUAGE GADTs, RankNTypes #-} module Data.Foldable.Mono((*$*)) where import Data.MonoTraversable(Element, MonoFoldable(..)) -- ^ from the mono-traversable package (*$*) :: MonoFoldable mono => (forall t. Foldable t => t (Element mono) -> a) -> mono -> a f *$* o = f (Foldabilized o) data Foldabilized a where Foldabilized :: MonoFoldable mono => mono -> Foldabilized (Element mono) instance Foldable Foldabilized where foldr f z (Foldabilized o) = ofoldr f z o -- (Similar implementations for the other methods can be included -- here for efficiency.)

And then use it like this:

import Data.Foldable.Mono import qualified Data.Text.Lazy as T testText = T.pack "foo quux bar" example1 = maximum *$* testText -- ^ equals 'x' example2 = mapM_ print *$* testText -- ^ prints "'f'\n'o'\n'o'\n..."

Notice that those are the polymorphic `Foldable`

functions `maximum`

and `mapM_`

, not `Text`

-specific functions. I don’t know if this has any real-world applications, but it’s kind of neat…

**Update:** As pointed out by lfairy on Reddit, the FMList type works kind of like this.

Capsules are nice because they can form both spherical and elongated shapes in any direction. The animation above shows how Super Smash Bros. Melee uses spherical hitboxes that are “stretched” across frames into capsules to prevent fast-moving attacks from going through opponents without hitting them. (Marvel vs. Capcom 3 uses the same trick.) What really makes capsules useful is that they have a very simple mathematical description: a capsule is the set of all points less than a certain radius from a line segment. This means you can check whether two capsules intersect each other by just finding the shortest distance between the two line segments and checking whether it is less than the sum of the radii.

Calculating the distance between two line segments is a well-known problem. This StackOverflow answer gives the code to do that with floating-point arithmetic. Sometimes, though, approximating the correct answer with floating-point isn’t good enough. What if we want an exact intersection test for capsules using only integer arithmetic?

I’ll be giving code examples in Haskell. The code will be for 2-D capsules, but the 3-D case is not too different. Let’s start with some basic definitions using the vector-space package. Since we’re using integer arithmetic, all of our vectors and radii should have integer values only.

{-# LANGUAGE TypeFamilies #-} import Data.VectorSpace -- Use arbitrary-size integers to avoid overflow in later calculations. -- If you are using very small values only, this may not be necessary. type GeomInt = Integer data Vec = Vec { vecX, vecY :: !GeomInt } deriving (Show) instance AdditiveGroup Vec where zeroV = Vec 0 0 Vec x1 y1 ^+^ Vec x2 y2 = Vec (x1 + x2) (y1 + y2) negateV (Vec x y) = Vec (-x) (-y) instance VectorSpace Vec where type Scalar Vec = GeomInt s *^ Vec x y = Vec (s * x) (s * y) instance InnerSpace Vec where Vec x1 y1 <.> Vec x2 y2 = x1 * x2 + y1 * y2 -- Represents a *closed* 2-D line segment. Zero-length segments are allowed. data Segment = Segment { segmentEnd1, segmentEnd2 :: !Vec } deriving (Show) -- Represents an *open* 2-D stadium (disk-capped rectangle). It is required -- that capsuleRadius > 0. data Capsule = Capsule { capsuleSegment :: !Segment, capsuleRadius :: !GeomInt } deriving (Show)

The first step of the line segment distance computation is to test whether the line segments intersect. The test shown in the StackOverflow answer doesn’t work for our purposes because it uses floating-point division and because it treats parallel segments as never intersecting. Instead, we can use the exact test from this page.

segmentsIntersect :: Segment -> Segment -> Bool segmentsIntersect (Segment p1 q1) (Segment p2 q2) = (o1 /= o2 && o3 /= o4) || (o1 == Collinear && onSegment p1 p2 q1) || (o2 == Collinear && onSegment p1 q2 q1) || (o3 == Collinear && onSegment p2 p1 q2) || (o4 == Collinear && onSegment p2 q1 q2) where o1 = orientation p1 q1 p2 o2 = orientation p1 q1 q2 o3 = orientation p1 q1 p1 o4 = orientation p1 q1 q1 data Orientation = Collinear | Clockwise | Counterclockwise deriving (Show, Eq) orientation :: Vec -> Vec -> Vec -> Orientation orientation (Vec px py) (Vec qx qy) (Vec rx ry) = case compare val 0 of LT -> Counterclockwise EQ -> Collinear GT -> Clockwise where val = (qy - py) * (rx - qx) - (qx - px) * (ry - qy) -- onSegment p q r checks if q lies on the segment pr assuming that -- p, q, and r are collinear. onSegment :: Vec -> Vec -> Vec -> Bool onSegment (Vec px py) (Vec qx qy) (Vec rx ry) = qx <= max px rx && qx >= min px rx && qy <= max py ry && qy >= min py ry

Now here’s the tricky bit. If the segments do not intersect, we can’t simply find the distance between them to check against the radii, because the shortest distance may not be an integer. The standard trick of doing all comparisons on squared distance values to avoid square root operations doesn’t completely solve the problem either, because the closest point on one segment to the other may not even have integer coordinates.

If we were using imprecise floating-point arithmetic, the test would look like this:

capsulesIntersect :: Capsule -> Capsule -> Bool capsulesIntersect (Capsule s1@(Segment p1 q1) r1) (Capsule s2@(Segment p2 q2) r2) = segmentsIntersect s1 s2 || check p1 s2 || check q1 s2 || check p2 s1 || check q2 s1 where thresholdSq = (r1 + r2)^2 check :: Vec -> Segment -> Bool check p (Segment e1 e2) | segLenSq == 0 || t <= 0 = magnitudeSq (p ^-^ e1) < thresholdSq | t >= 1 = magnitudeSq (p ^-^ e2) < thresholdSq | otherwise = magnitudeSq (p ^-^ near) < thresholdSq where d = e2 ^-^ e1 segLenSq = magnitudeSq d near = e1 ^+^ t *^ d t = ((p ^-^ e1) <.> d) / segLenSq

Since we’re using integer arithmetic, though, the `(/)`

operator is banned. The trick to pulling this off with integers only is to scale both sides of the third inequality by `segLenSq^2`

. This will cancel the denominator so that we don’t have to do any division. We can use primed variables to denote “multiplied by a factor of `segLenSq`

“. The exact capsule intersection is then:

capsulesIntersect :: Capsule -> Capsule -> Bool capsulesIntersect (Capsule s1@(Segment p1 q1) r1) (Capsule s2@(Segment p2 q2) r2) = segmentsIntersect s1 s2 || check p1 s2 || check q1 s2 || check p2 s1 || check q2 s1 where thresholdSq = (r1 + r2)^2 check :: Vec -> Segment -> Bool check p (Segment e1 e2) | t' <= 0 = magnitudeSq (p ^-^ e1) < thresholdSq | t' >= segLenSq = magnitudeSq (p ^-^ e2) < thresholdSq | otherwise = magnitudeSq (p' ^-^ near') < thresholdSq'' where d = e2 ^-^ e1 segLenSq = magnitudeSq d thresholdSq'' = segLenSq^2 * thresholdSq p' = segLenSq *^ p near' = segLenSq *^ e1 ^+^ t' *^ d t' = (p ^-^ e1) <.> d

Notice also that we don’t have to check `segLenSq == 0`

anymore, because the `t' <= 0`

case implicitly covers that.

Yay, math!

]]>I took the sample shader code from that page and **translated it into a simple WebGL demo**. You need to have a browser that supports WebGL and the `WEBGL_depth_texture`

extension. (Chrome should work, at least.) There are two sliders that let you control the subsurface scattering effect:

- One slider controls that simulated scattering radius by adjusting the distance between samples for the blur operation. If you turn this parameter up very high, you can get wave-like artifacts near sharp transitions in depth due to the way the depth buffer is factored into the blur operation. Increasing the number of Gaussian blur samples would reduce this effect at the cost of performance.
- The other slider controls how sharp a depth difference has to be before the shader will stop blurring across that area. If you turn this up to a large value, disconnected areas of the mesh will start blurring into each other, but if you turn it down too low the scattering effect will disappear completely.

The program’s output looked like this: `\a. DoubleNegElim (\b. a) : forall a. False -> a`

. On the right side of the colon, we have our type/theorem: for every proposition \(a\), \(\bot \rightarrow a\). The notation there is about as good as you’re going to get in ASCII text. But then we have the proof, on the left side of the colon. It’s an inscrutable lambda term encoding a natural deduction proof. The type inference engine is filling in the steps in the proof based only on which deductive rules were used, with the result that, even if you know how to read the notation, you really have to stare at it for a while to figure out why the proof *works*. While this admittedly lends an aura of elegance and mystery to the proceedings, I think I’d prefer a proof that is comprehensible by a human without significant head-scratching.

Perhaps this automagical type insertion is the source of the barrier to understanding? If we wrote out the full natural deduction proof with all the intermediate steps explicit, would this be clearer? Well, here’s a sample natural deduction proof in tree format, from this lecture:

OK, that’s slightly easier to *read*, but good luck *writing* that. And the format would be completely unmanageable for a larger proof. How about tableau proofs? Not to be confused with analytic tableaux, which are a different concept, this is the classic multi-column format dreaded by many sufferers of introductory geometry and logic classes. Here’s a (slightly abridged!) sample from ProofWiki in this format:

Line | Pool | Formula | Rule | Depends upon |
---|---|---|---|---|

1 | 1 | \(p \wedge q\) | Assumption | (None) |

2 | 1 | \(\neg(\neg p \vee \neg q)\) | Sequent Introduction | De Morgan’s Laws: Conjunction: Formulation 1: Forward, 1 |

3 | \((p \wedge q) \Longrightarrow (\neg(\neg p \vee \neg q))\) | Rule of Implication | 1 – 2 | |

4 | 4 | \(\neg(\neg p \vee \neg q)\) | Assumption | (None) |

5 | 4 | \(p \wedge q\) | Sequent Introduction | De Morgan’s Laws: Conjunction: Formulation 1: Reverse, 4 |

6 | \((\neg(\neg p \vee \neg q)) \Longrightarrow (p \wedge q)\) | Rule of Implication | 4 – 5 | |

7 | \((\neg(\neg p \vee \neg q)) \Longleftrightarrow (p \wedge q)\) | Biconditional Introduction | 3, 6 |

… yeah. All that entire table did was prove that, if we have \(p \wedge q \vdash \neg(\neg p \vee \neg q)\) and \(\neg(\neg p \vee \neg q) \vdash p \wedge q\), then we have \(\vdash p \wedge q \Longleftrightarrow \neg(\neg p \vee \neg q)\). I’m not sure we’re really making progress here.

So, to sum up, existing formalized systems of propositional logic are almost always either horrendously unwieldy or unreadably compact. In a future post, I’ll cover existing formal proof systems for more sophisticated mathematics as well as discussing possible improvements, but I figured a survey of some of the available systems for simple logic and their flaws was a good place to start.

]]>Enumerating proofs until you find one with the desired properties is one of those grand CS traditions that, like calling an \(O(n^{12})\) algorithm “efficient” while dismissing an \(O(1.001^n)\) algorithm as “intractable”, is handy for theoretical purposes but has essentially no bearing on the real world whatsoever. After all, **no one would really write a program that loops through all possible proofs and tests each one**.

Right?

I decided to experimentally test what happens if you actually try to enumerate proofs, starting with the shortest and working up to ever-more-complicated ones. To keep things simple, I’ll restrict the allowed sentences to propositional logic (no first-order predicates or quantifiers). Applying the Curry–Howard correspondence, every natural deduction proof corresponds to a lambda calculus term, with the type of each term being the theorem it proves. So all we have to do is enumerate well-formed lambda calculus expressions, applying Hindley–Milner type inference to determine whether each term is well-typed and, if it is, the most general theorem that it proves.

The lambda calculus by itself only gives you the implicational fragment of intuitionistic propositional logic. To get all of classical propositional logic, you have to add either additional syntax elements or additional axioms. I’ll use the following set of axioms:

\(\text{Unit} : \top\)

\(\text{DoubleNegElim} : \forall p.\ ((p \rightarrow \bot) \rightarrow \bot) \rightarrow p\)

\(\text{Pair} : \forall p, q.\ p \rightarrow q \rightarrow p \wedge q\)

\(\text{Fst} : \forall p, q.\ p \wedge q \rightarrow p\)

\(\text{Snd} : \forall p, q.\ p \wedge q \rightarrow q\)

\(\text{Left} : \forall p, q.\ p \rightarrow p \vee q\)

\(\text{Right} : \forall p, q.\ q \rightarrow p \vee q\)

\(\text{Case} : \forall p, q, r.\ p \vee q \rightarrow (p \rightarrow r) \rightarrow (q \rightarrow r) \rightarrow r\)

(Note that I’m assuming that \(\neg p\) is a synonym for \(p \rightarrow \bot\).)

I’ll use Haskell for this, since it’s my favorite programming language and it’s particularly well-suited to this sort of task. It isn’t too hard to write a corecursive function that spits out a list containing each lambda calculus term exactly once (up to alpha-renaming):

newtype VarID = VarID String deriving (Eq, Ord) data Expr = Var VarID | Expr `Ap` Expr | Lambda VarID Expr allExprs :: [Expr] allExprs = exprs [] (map VarID goodNames) exprs :: [VarID] -> [VarID] -> [Expr] exprs boundVars availableVars@(var:vars) = atomicExprs ++ interleave apExprs lambdaExprs where atomicExprs = map Var $ boundVars ++ Map.keys defaultVarTypes apExprs = [(exprs boundVars availableVars !! li) `Ap` (exprs boundVars availableVars !! ri) | (li, ri) <- pairingList] lambdaExprs = map (Lambda var) $ exprs (var:boundVars) vars exprs _ [] = error "Shouldn't happen." goodNames :: [String] goodNames = [[base] | base <- ['a'..'z']] ++ [base : show num | num <- [1..], base <- ['a'..'z']] pairingList :: [(Int, Int)] pairingList = [(a, b - a) | b <- [0..], a <- [0..b]] interleave :: [a] -> [a] -> [a] interleave (x:xs) (y:ys) = x : y : interleave xs ys interleave _ _ = error "Shouldn't happen."

This code is easier to understand from its results than it is from the code itself:

> mapM_ print allExprs Case DoubleNegElim Fst Left Pair Right Snd Unit Case Case \a. a Case DoubleNegElim \a. Case DoubleNegElim Case \a. DoubleNegElim Case Fst ...

The position of each expression in this list is a Gödel numbering of the well-formed lambda terms. It loosely corresponds with the “length” of the expression, so you could say that the first proof in this list proving a particular theorem is the “shortest” proof of that theorem, under a particular definition of “shortest”. “Least complex” might be a better description.

One thing we can do with this function is generate every proof up to a certain complexity, then sift through to see if there are any interesting theorems. We might call an “interesting” theorem one with a short description. If there are multiple valid proofs of a theorem, we pick the least-complex one. That’s what this next bit of code does:

listTheorems :: Int -> IO () listTheorems maxIndex = do let coveredExprs = take maxIndex allExprs theoremsList = foldl' (\best e -> case principalType defaultEnv e of Left _ -> best Right t | isNew (map fst best) t -> (t, e):best | otherwise -> best ) [] coveredExprs sortedTheorems = sortBy (comparing (typeLength . fst)) theoremsList forM_ sortedTheorems $ \(t, e) -> putStrLn $ show e ++ " : " ++ show (prettify t) isNew :: [Polytype] -> Polytype -> Bool isNew existing maybeNew = not $ any (\pt -> isMoreGeneralThan defaultEnv pt maybeNew) existing

Notice how we omit a proof if we’ve already given a simpler proof of a more general result, but include a proof of a more general theorem even if it’s more complex to show than previously-discovered restricted cases.

The complete Haskell source code for this toy is available here. I compiled it with `ghc -O2`

, which helps a *lot* speed-wise relative to just running it in GHCi. Running `listTheorems 4000`

, here are the top 15 most elegant tautologies (according to the program) and their simplest proofs in this particular axiom system:

Unit : True \a. DoubleNegElim (\b. a) : forall a. False -> a Right Unit : forall a. a || True Left Unit : forall a. True || a \a. Unit : forall a. a -> True \a. a : forall a. a -> a \a. \b. DoubleNegElim (\c. b) : forall a b. a -> (False -> b) Right (\a. Unit) : forall a b. b || (a -> True) Left (\a. Unit) : forall a b. (a -> True) || b \a. Right Unit : forall a b. a -> (b || True) \a. Left Unit : forall a b. a -> (True || b) Right (\a. a) : forall a b. b || (a -> a) \a. a Unit : forall a. (True -> a) -> a Left (\a. a) : forall a b. (a -> a) || b Pair Unit : forall a. a -> (True && a)

The proof of the principle of explosion on the second line is particularly nice, but the program’s ranking of what constitutes a “nice” theorem starts getting weird after about line 6. The ranking rule of *shortest theorem* is very simplistic, not taking into account things like how each theorem has an impact on making others easier to prove. In fact, the program never adds theorems to its “library”, which means that finding a proof of a sentence that depends on several other theorems with complex proofs is very hard, because it has to re-prove them every time.

Another interesting thing to try with this code is to search for the least complex proofs for a given theorem. This is easy enough to do:

listProofs :: Monotype -> IO () listProofs mt = do let targetTheorem = universalClosure defaultEnv mt forM_ allExprs $ \e -> when (isProofOf targetTheorem e) $ print e isProofOf :: Polytype -> Expr -> Bool isProofOf pt e = case principalType defaultEnv e of Left _ -> False Right t -> isMoreGeneralThan defaultEnv t pt

Running `listProofs`

on an easy theorem rapidly generates an infinite list of proofs of increasingly-unnecessary complexity. For example, running it on \(\bot \rightarrow \bot\) produces a list that starts with:

\a. a DoubleNegElim DoubleNegElim \a. (\b. b) a (\a. a) (\a. a) \a. (\b. a) a \a. (\b. a) Case \a. DoubleNegElim (\b. a) \a. (\b. a) DoubleNegElim \a. (\b. a) Fst \a. (\b. a) Left (\a. a) (DoubleNegElim DoubleNegElim) ...

On the other hand, I still haven’t gotten a search for the law of the excluded middle \(p \vee (p \rightarrow \bot)\) or Peirce’s law \(((p \rightarrow q) \rightarrow p) \rightarrow p\) to succeed from this axiom set. Since the axioms are complete for classical logic, such a search must terminate eventually, but the shortest proof may be way down the list.

**Get the source code and play with it yourself!**

But, of course, you’re not here for me, you’re here for you. So let’s talk about **Google Guava**. For the unfamiliar, Guava is a collection of utilities for performing common tasks in Java, similar in spirit to Apache Commons or C++’s Boost. I was introduced to Guava by a fellow CERTON developer for the purpose of performing some set operations more easily in the selection logic in CertSAFE Modeler. Since then, the use of Guava in our codebase has expanded greatly, making it far and away the library we get the most use out of.

Even if you’re already using Guava, there’s most likely parts of it that you’re not using currently only because you don’t know they exist. I know I find myself saying “Wow! I didn’t know Guava had that!” on a regular basis. Here’s a quick list of the most common uses of Guava in the code I write. Undoubtedly there’s things I haven’t listed here because I haven’t discovered them yet—and you might just find something in this list that you wish you had known about a year ago.

**Immutable collections.**I’m a functional programming wonk, which means that, by default, I think in terms of immutable values and pure functions rather than updating state in place. It’s not like I never write imperative code; certainly, when clarity or efficiency calls for it you can’t beat a good`ArrayList`

or`HashMap`

. But I think I get more use out of Guava’s`ImmutableSet`

,`ImmutableMap`

, and`ImmutableList`

than I do out of all of the`java.util`

collection classes. Special mention goes to the`ImmutableMap.Builder`

class, which has the neat property of*rejecting duplicate keys*rather than collapsing them like doing repeated`put()`

calls on a`Map`

does. Thus, using`ImmutableMap.Builder`

serves a triple purpose: it documents that your insertion strategy will not add duplicate keys (usually because you’re looping on another collection and adding keys one-to-one); documents that the resulting collection is immutable and cannot be changed; and catches bugs by detecting duplicate keys or attempts to mutate the collection when you didn’t expect them.**Oddball collections.**This includes`BiMap`

,`Multiset`

,`ListMultimap`

,`SetMultimap`

,`Table`

, and more. After using these, you’ll be asking yourself why they weren’t in the Java Collections API to begin with.**Misclleaneous collection utilities.**This includes`Sets.union()`

,`Sets.intersection()`

,`Sets.difference()`

,`Iterables.filter()`

,`Iterables.transform()`

, and probably my single favorite method in Guava:`Iterables.getOnlyElement()`

, which takes an`Iterable`

that must have exactly one element and returns that element, throwing an unchecked exception if it contains zero or more than one element.Despite the documentation disavowing any relationship between`Optional`

.`Optional`

and “any existing ‘option’ or ‘maybe’ construct from other programming environments”,`Optional`

looks like, walks like, and quacks like an option type. Since I discovered`Optional`

early in the development of CertSAFE, our team has adopted a soft “no`null`

s” convention—i.e., only use`null`

if there’s a good reason not to use`Optional`

, like strict performance requirements. (And*definitely*don’t use -1 or similar in-band signal.) Taking it a step further, I often write defensive fail-fast`null`

checks on the parameters of public constructors and methods. (Since our codebase is Java 7, I use`Objects.requireNonNull()`

, but pre-7 users can use Guava’s`Preconditions.checkNotNull()`

.) The result has been a very visible decrease in hard-to-reproduce`NullPointerException`

s and a great deal more confidence in the quality of the code I write, including in areas that are hard to unit test.This is a little class that will save you a lot of silly redundant`Charsets`

.`try-catch`

blocks, by giving you constants representing the six`Charset`

objects that every JVM is guaranteed to support.

There are other Guava utilities that I use regularly as well, like `Ordering`

, `Joiner`

, the caches, etc., but the classes listed above are the ones I felt the most relief upon discovering.