|
|
|
|
|
| [oats-sig] RE: Word/Sentence Prediction Tool for Code-A-Thon | |
|
Simon.Judge at nhs.net
Simon.Judge at nhs.net
|
|
| Article: [oats-sig] RE: Word/Sentence Prediction Tool for Code-A-Thon | |
|
Hi, this conversation has been happening off line about the code-a-thon - read down to find out what has been going on (bit messy, sorry) and feel free to comment. Apologies for this - we tried to post to the list, but the messages got bounced. S ---- I'm glad this is prompting interest for the code-a-thon, great stuff: >From my perspective, i'm not too fussed about the method of implementation (e.g. MySQL), however there are a couple of constraints - a)small package size, b) single install file c)lookup time significantly <1s, ideally <100ms (i.e. barely noticeable)... I think the idea of having character and word and sentence prediction is one to pursue and there are a number of possible engines we might have identified to maybe build on (e.g. Dasher for character, Prophet (if released) and some of the others Chris has found). I find the idea of using Wordnet VERY interesting, especially given the work on concept coding... That said, I think this should build on a simple, small, 'light' engine that offers more standard functionality... I am waiting to hear back from my colleague about releasing disambiguation code for another prediction project... I really hope Sheri decides to open the prophet code - get pursuing Mats - it would be a great motivation, to release it for the code-A-thon one would have thought. It also occurs to me that we should summarise all the useful info below on a page - either the OATS or project possibility site, whatever happens with this code-a-thon. Volunteers? I have a couple of papers on the topic here: http://www.citeulike.org/user/simonjudge/tag/word-prediction (with a few more to tag)... if anyone else finds some, you could always post them to the CiteULike Assistive Technology group... Cheers. Simon -----Original Message----- From: Christopher Leung [mailto:christopher.leung at projectpossibility.org] Sent: Sunday, November 11, 2007 7:01 AM To: mats.lundalv at vgregion.se Cc: Simon Judge; marc.allen at projectpossibility.org; Andrew Lysley; druce at ace-centre.org.uk; colven at ace-centre.org.uk; eive.landin at sit.se; William Chang Subject: Re: Ang: Re: Ang: RE: Word/Sentence Prediction Tool for Code-A-Thon Hi Mats, Thanks for your comments! I was able to speak a friend and Natural Language Processing Ph.D candidate at my university, and not surprisingly, he came to the same conclusions as you have stated. Again, I'm having fun with this, but I'd like to run this by you. How about using a, say, MySQL database to take care of most of our lookup operations. Below I've written a SQL query that, given a table "3WF" which contains "three word fragments" (three columns, one word each), and a table "WFREQ" containing a list of words and their frequency of occurrence (two columns), both of which are populated by some source text, returns a ranked list of words that begin with the characters currently typed, whose two previous words match the current two previous words. SELECT DISTINCT T1.WORD_3, T2.FREQUENCY FROM 3WF T1 WFREQ T2 WHERE T1.WORD_1 LIKE "%$w1%" AND T1.WORD_2 LIKE "%$w2%" AND T1.WORD_3 LIKE "$w3%" AND T1.WORD_3 = T2.WORD ORDER BY T2.FREQUENCY DESC $w1 = word 1 $w2 = word 2 $w3 = what's been typed of word 3 The database could also be used to do simple character-level prediction. Given that there would probably be less than 100,000 entries in the 3WF table (the length of a novel) and less than 20,000 entries in the WFREQ table, perhaps we can achieve a real-time enough lookup performance for the user (<1 sec of typing a key).. Chris mats.lundalv at vgregion.se wrote: > Hi Chris and all, > > See my comments below ... > > Cheers, > Mats > > -----Christopher Leung <christopher.leung at projectpossibility.org> skrev: > ----- > > Simon, Mats, thanks for your responses. > > Unfortunately Dasher seemed to have trouble detecting the language > correctly on my Windows system and defaulted to Swedish (I think), so > I'm not able to play with it. > > *-> In Swedish of everything - doesn't usually pop up as a default > on non-Swedish systems ;-) Hope it will be sorted out! > * > I think this site summarizes the problem nicely: > http://www.asel.udel.edu/natlang/nlp/wpredict.html > (They seem to have a lot of very relevant projects, we would probably > have to contact them to get access to any of it, unfortunately it looks > like it's mostly done in LISP) > > *-> Good old Pat Demasco and friends at Univ of Delaware - paid him > a visit in the mid 90:s when we had a cooperation going on, but he > then he suddenly disappeared from the scene. Anyway, probably some > good reading, though a bit old. You will also find some old stuff > from Sheri Hunnicutt if you look around (look for "word prediction > Hunnicutt") I found a later very short outline of her group's > latest work on prediction at: > http://www.speech.kth.se/prod/publications/files/qpsr/1996/1996_37_2_101-104 .pdf**, > see some comments about this and Prophet below. > **There are also several docs from - and references to - work by > Dundee people when they were at it (("word prediction Dundee"), and > if you look for "word prediction fasty" you'll find some stuff from > a european more recent word prediction project "FASTY" that never > seemed to really make it to a product. Hmm, wonder if there could be > some candidate code for open sourcing in that heap?* > > I agree, we can definitely leverage off the word listing algorithms out > there as you mentioned that have been open sourced.. > > As for the prediction, unless there is something out there already, I > can ping one of my linguistics/natural language processing friends to > see if he can offer any input on a basic grammar that could 'filter' > the > possibilities of words that could appear next, given a sentence. > > Prophet mentions "word pairing"--I wonder if their algorithm simply > limits the possibilities based on the previous word (for example, a > verb > cannot follow a verb, an adjective cannot follow an adjective without a > conjunction, etc.). That would be an easy way to add some intelligence > to the world prediction. > > ***-> The existing Prophet predictor is just statistical based > prediction, based on a main lexicon derived from a (hopefully) good > large text corpus with frequency information for each word - which > is then complemented by a dynamically user generated lexicon based > on frequency of use - PLUS a "word-pair" lexicon which is based > on the statistical relation between commonly used word pairs in the > specific language. > **The beauty of this is that the method can fairly easily be applyed > on different languages with equal success - if you just have good > lexical material to start from - and you can easily complement the > lexicons by generating your own from text material (for special > topic lexicons etc).* > > *If I remember right from Sheri's and others work their conclusion > was that to have syntactic processing in a more useful way you will > generally need to look at least two words back to get enough context > to draw from. That is what the work I referred to above was all > about. There was actually such a version of Prophet developed in the > late 90:s, but it was never released because of political intrigues > that stopped Sheri from using the resources in her product. This may > be the basic prediction engine code that we could have access to > soon - we are talking about a meeting in early December with Sheri > now.* > > *The disadvantage of this and othere more advanced approaches is > that any implementation becomes much more language specific of > course. In fact, I think Sheri's and others research showed that > good statistical word prediction does most (some 70% or so) of the > job well, and the additional advanced processing needs to work very > hard to add the few extra percentage to the result. * > > *So I think we should think in several steps: > - Start with good basic and relatively language neutral prediction > tools (like most of the proprietary and free stuff out there in Windows) > - Complement that with high quality basic word list tools - like > WordAid - for early level writers/readers with limited vocabularies > (the advantage being more stable/non-dynamic presentation etc) > - And after that start looking at more advanced alternatives with > syntactic and semantic processing stuff.* > > And if we want to get really clever, we can leverage off a freely > available database like the following which offers a structure of > semantic relationships between words: > http://wordnet.princeton.edu/ > > Theoretically, we could begin to predict a set of words based on their > semantic relationships to the context... > > *-> Yes, exciting stuff isn't it! ;) We're very involved with > WordNet in our work to establish restricted so called "Concept > Coded" multimodal vocabularies with graphical symbol support for > non-reading users of Augmentative and Alternative Communication > (AAC) - check http://www.conceptcoding.org and > http://www.symbolnet.org . This could definitely be used for more > intelligent prediction when it gets a bit more mature in a few years > - but I think we should probably start to fill in the worst potholes > first, eh? ;) > * > Yes, getting a bit ahead of myself here, especially for Code-A-Thon > project, but I find this to be a really exciting problem.. > interested to > hear your thoughts. :) > > Best, > Chris > > > > > > mats.lundalv at vgregion.se wrote: > > Hi Chris and all, > > > > I agree, a great and much needed initiative! > > > > Concerning Prophet: I was in touch with Sheri Hunnicutt (the > owner of > > the code) last week, and she is seriously interested in letting > someone > > dive into the sources of Prophet to see if it could be > transferred to > > the Linux world - as free software. The basic prediction engine was > > written in C on a Unix system, and then transferred to Windows. I > would > > suspect that the current UI code is more specific Windows C++. But > > anyway; the foundation for a cross platform version is there. > > A strong feature of Prophet is that it is a high quality > predictor with > > rule support for several languages (currently Swedish, UK English, > > Finnish, Norwegian, French, Danish and Dutch - and probably a > Russian > > version somewhere). > > > > What is not quite clear in our communication so far is whether > Sheri is > > considering to completely release the code as OSS, including the > Windows > > version. > > I would think time is working for that, but the question is if the > > timing is reasonable to have something ready for this Open Source > > Accessibility code-a-thon - I don't think so. Will be something > to come > > back to a bit later when the conditions have been sorted out. > > > > Another option is diving into the ACE Centre's (and Swedish > Institute's > > for Special Needs Ed.) WordAid 2.0 software, which is just about > to be > > released as OSS. This is rather a "word list" - rather than > "prediction" > > - writing support tool, but a very good one for earlier level > use, and > > would be a very valuable contribution to the GNU/Linux world. > Check out > > at ACE's WordAid info. > > > <http://www.ace-centre.org.uk/index.cfm?pageid=BE77E26D-D613-62F1-CFEC170E24 F4038A&productid=BE7775C8-D613-62F1-C12B0D2C7A88970B > <http://www.ace-centre.org.uk/index.cfm?pageid=BE77E26D-D613-62F1-CFEC170E24 F4038A&productid=BE7775C8-D613-62F1-C12B0D2C7A88970B>> > > This is definitely a Windows specific implementation, but the > task could > > be to investigate, propose and prototype the most efficient and > > promising path for a transfer of the functionality of WordAid to a > > cross-platform existance, either by re-purposing/writing the Win C++ > > code to a cross-platform C++ version, or by transferring it e.g. > to Java > > - proposing tools and methods for this, etc. > > > > What do you think guys? > > > > Otherwise: Havent had time to look into all your links, but the > > LetMeType pack looks a bit interesting - seems to need some updating > > plus better internationalisation support (Unicode etc) - and > > cross-platform remake, but otherwise seems to be appreciated. > > > > Mats <mailto:mats.lundalv at sit.se> > > > > -----"Simon Judge" <simon.judge at nhs.net> skrev: ----- > > > > That sounds fantastic. > > > > It would be interesting to 'repackage' Dashers prediction engine > > (which is a > > character level prediction) for a more 'standard' interface - > i.e. a > > prediction list of words... and see what happened... > > > > RE ClickNtype - as you say this is closed source, though I > think I > > mailed > > the person a while back and he said he might open it (so you > could try > > again)... and also, the disadvantage is that the prediction > is tied > > to the > > on-screen keyboard (i.e. you can't use it with normal typing). > > > > LetMeType might have useful code, but isn't really a 'standard' > > prediction > > method since it doesn't have a language model (I don't think). > > > > Never seen the NLI one... > > > > RE T9 - this is of interest to me and I am researching it. > Check out: > > > > http://www.assistech.org.uk/doku.php/research:disambiguation > > > > Tapir is of interest in this list and open source, but again, > it doesn't > > meet the needs of people wanting to enter text through the > keyboard. A > > colleague is developing an (eventually open source?) version > of this > > keyboard that works with a numberpad keyboard. I might be able to > > persuade > > him to release the source to you if you want to have this as > a seperate > > project... however he was worried about litigation and > looking into > > signing > > the code over to FSF for this reason... > > > > Cheers. > > > > Simon > > > > -----Original Message----- > > From: Christopher Leung > > [mailto:christopher.leung at projectpossibility.org] > > Sent: Wednesday, November 07, 2007 11:00 AM > > To: simon.judge at nhs.net; mats.lundalv at vgregion.se > > Cc: marc.allen at projectpossibility.org > > Subject: Word/Sentence Prediction Tool for Code-A-Thon > > > > Simon, Mats, > > > > We're very interested in the possibility of making some > headway on a > > word > > prediction tool for the SS12 Code-A-Thon. You two happened to > > mention this > > about a month ago on the oats-sig email list. > > > > Though this is probably something that should come out of a > research > > project, it would still be a fun project for us to work on > over the > > weekend, > > especially if we can leverage off existing work/research. > > > > You mentioned: > > > > Prophet (Commercial) > > > http://www.ace-centre.org.uk/index.cfm?pageid=E79ED3AB-D613-62F1-CD947B4D353 > > 9E836 > > > > Dasher (Open source) > > http://www.inference.phy.cam.ac.uk/dasher/ > > > > ClickNType (Free ware, not open sourced) > http://www.lakefolks.org/cnt/ > > > > Prophet and ClickNType could be good programs to test and compare > > against > > but obviously without them being open source, not extremely > helpful. On > > that note, "T9" technology is another popular commercial > reference > > for this > > problem: http://www.nuance.com/t9/textinput/ > > > > I did play with Dasher (interesting interface). Maybe we can > reuse > > some of > > the word prediction code out of there. Also, I've seen some > basic word > > prediction in software like open office. > > > > I've just Googled for open source word prediction and found the > > following: > > 1. LetMeType - http://www.clasohm.com/lmt/en/2. Word Prediction > > Source Code > > - http://www.asel.udel.edu/nli/pubs/1991/VanDyke91d.ps > > > > Nothing immediately usable or cross-platform. > > > > Lots more to say but I'll stop here--I look forward to > hearing your > > thoughts... > > > > Best, > > Chris > > > > > > > ********************************************************************** > > This message may contain confidential and privileged > information. > > If you are not the intended recipient please accept our > apologies. > > Please do not disclose, copy or distribute information in > this e-mail > > or take any action in reliance on its contents: to do so is > strictly > > prohibited and may be unlawful. Please inform us that this > message has > > gone astray before deleting it. Thank you for your > co-operation. > > > > NHSmail is used daily by over 100,000 staff in the NHS. Over > a million > > messages are sent every day by the system. To find out why > more and > > more NHS personnel are switching to this NHS Connecting > for Health > > system please visit www.connectingforhealth.nhs.uk/nhsmail > > > ********************************************************************** > > > > > > ********************************************************************** This message may contain confidential and privileged information. If you are not the intended recipient please accept our apologies. Please do not disclose, copy or distribute information in this e-mail or take any action in reliance on its contents: to do so is strictly prohibited and may be unlawful. Please inform us that this message has gone astray before deleting it. Thank you for your co-operation. NHSmail is used daily by over 100,000 staff in the NHS. Over a million messages are sent every day by the system. To find out why more and more NHS personnel are switching to this NHS Connecting for Health system please visit www.connectingforhealth.nhs.uk/nhsmail ********************************************************************** |
|
| Main Becta Site | | Return to top |