0

BookWorm 0.1 now available to download

Posted by tnicoll on March 11, 2012 in Development

Get it while it’s lukewarm.  The first release of BookWorm is now available.  Nothing substantial to add since the last post, other than it’s now running Apache Tika 1.0.  This seems to have fixed an issue I was having opening some PDFs but possibly introduced a new issue with the dialogue count not working correctly in some cases.   So that’s one known bug.  Can you find any others?  Probably.  Give me a shout if you do though.  Don’t just keep it to yourself like some greedy bug troll or something.  Anyway, if you’re a writer or interested in text analysis I hope this application is of some use to you.

T.

Tags: , , , , , , , , , , , , , , , , ,

0

Sweet Ass Layouts

Posted by tnicoll on March 5, 2012 in Development

Just a quick post to point out that I’ve finally fixed the godawful layout of the statistic text fields in bookworm.  Not had a lot of time for writing code recently but had a couple of hours tonight so finally got round to fixing the horrible looking display.  I used SpringLayout, in particular the SpringUtilities class to create a simple, but infinitely better looking stat grid.

I’ve never been overly enthusiastic about GUI design, preferring the nuts-and-bolts code behind the scenes, but it’s hard not to acknowledge the importance of it.  It’s got to be done because it doesn’t matter how good your program does what it’s supposed to, if it looks like crap no one will use it.  That’s just the way it is.  Thankfully, Java comes with a lot of tools to help programmers like me put together something that looks okay fairly easily.  Thanks Java!

I think the next step is to refactor some code, get some unit tests done and then see where the project is at.

Tags: , , ,

0

Syllables and Word Types

Posted by tnicoll on February 12, 2012 in Development

I’ve just implemented syllable and word type counting in Bookworm.  Basically, words are now checked against a dictionary not just for their spelling but for the number of syllables they contain and the type of word they are (noun, verb, adjective or adverb).  This involved creating a custom dictionary file created from an assortment of different word lists supplied from Ashley Bovan’s website.  The end result is a fairly substantial dictionary.txt file of the format <word>|<syllables>|<type>.  Unfortunately, information is not complete on these files so results can be fairly patchy but it’s a start and might prove useful to some people.  At some stage I’d like to implement a way to easily update this information through the application itself, although it’s not a difficult process to update the dictionary file manually, merely a time consuming one.

I’ve had a cursory look at determining syllables programmaticly.  It doesn’t appear to be an easy problem to solve.  I’d be concerned with implementing any approach that would carry any considerable processing time, but this could be minimized by processing the words in the existing dictionary text file separate to the main application, and only processing words not contained within dictionary.txt in Bookworm itself.  Of course this would depend on finding a reliable algorithm to begin with.

As far as word types go, I may have to look at whether there are any services that will allow me to look up words and determine their type, in large volume.  Because manually going through the list with a Oxford English Dictionary is *not* happening no matter how useful the end result.

Unless I get really, really bored.

Obviously.

Tags: , , ,

0

Counting Dialogue

Posted by tnicoll on February 7, 2012 in Development

I’ve just completed the last of the initial set of ‘counts’ – Dialogue.  What I originally thought might be quite difficult has actually turned out to be not that difficult at all.  Which is slight worrying in itself.

I solved the problem using a simple regular expression.  It had to be simple, since that’s the level of my understanding of regular expressions.  The problem as I saw it was that I needed to count the number of occurrences of areas where text lay in between two quotation marks.  An obvious problem was that speech could be found between double and single quotes, depending on the preference of the writer, and I believe there may be a US/UK difference of opinion on which is the correct one to use.  A consequence of catering for single quotes was of course that apostrophes are used for other purposes and I would have to find a way to avoid including them in the counts.

After thinking about it, I was able to narrow the problem down.  I didn’t have to count all occurrences of single or double quotes.  I only had to count those that were immediately followed by a space, tab or end character or some kind.  This is how speech is supposed to be written in text:

“This is how,” said tnicoll, “speech should be written.”

With that in mind, the regex seemed to be straightforward:

['’\"]\\s

All this does is check for text that matches one of the following characters ‘ ’ “ with some form of whitespace after it.  Using the Pattern and Matcher classes it then is a simple case of calling Matcher’s find function inside a while loop to count the number of occurrences.

This seems to do the trick perfectly.  I may be missing something though so if you do have any suggestions/corrections I’d be grateful to read them.

Elsewhere, besides the stuff I’ve already mentioned I have a few more ideas for statistics to add.  These include documenting the number of nouns, verbs, adjectives and adverbs and sorting/counting words by syllables.  I’d also like to play around with other things that can be produced from the data, such as graphs and even creating some sort of fingerprint image from the text.  As with anything in this program, the results may be fascinating or mind-numbingly boring.   I suspect there is no middle ground here.

Tags: ,

0

Spellchecking

Posted by tnicoll on February 4, 2012 in Development

I’ve just completed Bookworm’s spell checking system.  It’s a simple approach.  Every word detected is checked against a hashlist of dictionary words for it’s existence.  If it isn’t found it gets marked as an unrecognised word and a counter is incremented for displaying the total number of suspected incorrect words.  This involved switching back to using a Word object that basically is a wrapper for a string with a boolean value to keep track of whether the word has been recognised or not.  I had initially used a Word object with this in mind but completely forgot about it when I switched to using the guava collections classes for storing words.  I suppose this is why it’s a good idea to write this stuff down.

At the moment the spellchecker does not do anything sophisticated like recommend close matches.  The application is primarily going to be an analysis tool, not a spell checker so I’m unsure if I will bother with this.  I have implemented one of these before at work although it was not particularly fast or memory efficient.  I believe apache lucene offers a solution to this so I may investigate at a later stage but this probably won’t be a priority.

I am pretty convinced that I will offer the ability to set a custom dictionary, which would be useful for internalisation purposes.  This will involve adding a preferences menu, which I was planning on adding at some stage but will now probably add sooner rather than later.

The next big piece of functionality will involve counting the number of quotes in a document.  I will talk more about this in another post.

Tags: , ,

0

To Do List

Posted by tnicoll on January 21, 2012 in Development

There are a few things that I would like to implement in bookworm.  In no particular order:

  1. Text highlighting – clicking on a word in the list highlights all occurrences of the word in the text pane and scrolls to the first instance of that word.
  2. More statistics – Paragraph count, word count, average word count per paragraph, sentence count, speech count, file size, character count, spelling error count.
  3. Cleaner statistic display.
  4. Spellcheck.
  5. List of phrases.
  6. Word list that count’s words like is not and isn’t as the same thing.
  7. Loading from web pages.
  8. Internalization.
  9. Help facility.
  10. Clean up/document/unit test code.

Of those, number 2 looks like the one I will be focusing on in the short term, since it’s really the area that is visibly lacking in the application right now.  Once this in place the application should be in a reasonably useful state.

Tags: ,

0

Update

Posted by tnicoll on January 18, 2012 in Development

Development on Bookworm has progressed at roughly the same speed as my posting on this blog.  A baby will slow things down.  As will trying to level up in Battlefield 3 and Skyrim.  However I am keen to get back into things.  I committed some changes last night to git.  Specifically, changing the application to use a Mapset from google’s guava collection library .  This is a great library that continues the excellent work started by Joshua Bloch in the Java Collections.  This seemed to solve the issue of counting the frequency of strings in a much more elegant and useful manner, removing the need for me to maintain my own data structure.  Reinventing the wheel and all.

I’ve also realised how much I’ve come to love maven.  For adding resources to a project it’s just so straightforward.  You just add a repository to an xml file and then never have to think about it again.

I hope to post more soon about some thoughts I’ve been having about Bookworm.  But don’t wait up for me or anything.

Tags: , ,

0

While we’re on the subject

Posted by tnicoll on October 17, 2011 in Development

Also: check out project euler if you’re really interested in a challenge.  By challenge, I mean maths.  Lots of maths.  I’ve only completed 13 of the challenges (they’re hard alright!).  Worth doing if you’re into algorithms and optimisation, or again just something to pass the time.  The early problems aren’t too bad but it does get tough fairly quickly.  Good luck!

A long way to go...

Tags:

0

CodingBat

Posted by tnicoll on October 16, 2011 in Development

Haven’t had much of a chance to update this recently.  Turns out having a kid tends to take up quite a bit of time.  Who knew?  They should tell people that beforehand.

Just wanted to mention one of my favourite programming sites: CodingBat

This site is awesome if, like me, you don’t have as much spare time on your hands as you’d necessarily like, but still want to write some code.  It’s basically just a huge collection of small programming puzzles for you to solve in either java or python.  You write and compile the code in the browser so there’s no need to even fire up an IDE.  Great for warming up, winding down or just passing the time without making you feel like you’re wasting it.

 

 

Tags:

0

BookWorm or how I got off the can and started writing code

Posted by tnicoll on September 19, 2011 in Development

For someone who has been writing code professionally for almost five years now, and written a lot of code before that in University, and spent too much of his childhood typing in program listings from magazines into a ZX Spectrum, it hit me recently that I don’t have an awful lot of public code to point to to show someone what I can (or can’t) do.  In fact, I have none.  I have real life production code out there, working away while you read this.  Code I’m proud of and code I’m, well, less proud of.  Unfortunately, code that my employer owns and code that they certainly don’t want me showing off to anyone else.

It’s a common enough issue.  Without knowing real life statistics on it, I’d feel fairly confident guessing that the majority of employers don’t open source their code.  Whether they should or not is not really the issue.  The fact is they don’t.  And even if you disagree with it in principle, you’d be pretty stubborn I think if you couldn’t see the argument for not releasing your code.  What this means for developers is that they will never be able to show off the code they produce day to day.  You know… to future employers.  Or their kids.  But mostly future employers.

This was one of the reasons I wanted to start an open source project.  I’m happy to say that I’m fairly happy in my current job so I wasn’t driven entirely by the desire to have a body of work to back me up in an interview.  Although it’s certainly not something that would hurt to have.  One of the main reasons was the desire to start playing with things they don’t let us play with at work.  There is so much cool stuff out there right now like Apache Tika and it’ll be some time before we get to use Java 7 in work.  Put it this way: I fully expect 8 to be released before then.

There are reasons I haven’t done this before of course.  The main one being laziness.  The other was finding the right project.  If you’re thinking this one is an excuse, you’re probably right.  There are tons of interesting open source projects out there that are looking for help.  However, I will concede that it can often be daunting approaching an established project that you find interesting and trying to fathom how you can possibly help.  Especially as a beginner, where merely trying to understand the code can be enough of a challenge to put you off, never mind actually contributing anything.

Recently though, I had an idea for a piece of software.  I write a lot in my spare time.  Well, I wasn’t writing code, I might as well do something eh?  I was thinking how cool (geeky) it would be to be able to look through a  document and see how many times you’ve used particular words.  Just to see how often you repeat yourself.  And it got me thinking how useful it might be to know other stats about a piece of writing.  As it turns out, someone has already written software to do just that.  It’s called Textanz and it looks rather good.  But the more I thought about it, the more I realised I could write this software myself.  While the problem space is unlikely to be of interest to most people there were some interesting technical challenges to solve.

So I registered myself on github, started a repository and began coding.  It’s called BookWorm and you can access it here if you like.  It’s starting to take shape and I hope to discuss it’s progress in this blog.  There’s still loads to do.  But that’s fine, it’s not like there’s a deadline involved here.

If you’re like me and you’ve been thinking about starting/joining a project then you should.  Right now.  Go on.  Go.  Piss off already.

Tags: , , , ,

Copyright © 2011-2012 tnicoll.co.uk All rights reserved.
This site is using the Desk Mess Mirrored theme, v2.0.2, from BuyNowShop.com.