Wednesday, 28 November 2012

Wednesday 21st November - XP cont.

This was the first occasion that all members of the team met together, which was good. This session was much like the one before.

We first discussed what the next developmental step should be. We mutually decided that the removal of stop words should be our next step in order to reduce the size of the data set.

After several fails attempted we managed to incorporate this function into the loop we created last week. The loop takes the text files and removes the stop words using the NLTK, leaving a list of our non stop words. This list is then written back to file.

It was mentioned in this meeting that it would be more productive to split the project into different sections and each member of the team would work on their respective section. We mutually agreed that this was a good idea and look to Jeremy Ellman for advice on how best to do this.

Attendees: Stephen Brown, Andrew Hill, David Wajiya, Anupreet Kaura, Alexandru Palade

Wednesday 14th Novemeber - Extreme Programming

Today we had a second meeting, which was more of an extreme programming session. (in the sense there were 4 of us working on one module)

Firstly we generated a small amount of abstract pseudo code to outline the logical steps we needed to take to make a start on the solution. After this we fired up a python IDLE and started on our first step, which was reading in all of the HTML files.

We managed to create a loop to go through every folder and sub folder in the test data and look for HTML files. Where the loop found HTML files, beautiful soup would extract the text, which then be saved as a plain text file in its respective directory.


We were happy with the results of our first programming session.

Attendees: Stephen Brown, Andrew Hill, David Wijaya, Anupreet Kaur

Wednesday 7th November - Meeting

Today we had our first group meeting. In the meeting we sat down and read through the assignment documentation and discussed how we should start working toward an solution. We came to a mutual decision that we should go home and do some research in order to gain some knowledge of the task at hand. That was where we left the meeting.


Attendees: Stephen Brown, Andrew Hill, David Wijaya, Anupreet Kaur

Inductory Post

As part of our assessment for the Artificial Intelligence module, we have been asked to work in a group of up to 5 to create a solution to the web person search task as defined below:

"This task focuses on the disambiguation of person names in a Web searching scenario. Finding people, information about people, in the World Wide Web is one of the most common activities of Internet users. Person names, however, are highly ambiguous. In most cases, therefore, the results for this type of search are a mixture of pages about different people that share the same name.
 
The participant's systems will receive as input, web pages retrieved from a web search engine using a given person name as a query. The aim of the task is to determine how many referents (different people)exist for that person name, and assign to each referent its corresponding documents. The challenge is to correctly estimate the number of referents and group documents referring to the same individual." (Artilles et al. 2007 Appendix 1. http://nlp.uned.es/weps/weps-1/weps1-data)

The group members are as follows:
  • Stephen Brown
  • Andrew Hill
  • David Wijaya
  • Anupreet Kaur
  • Alexandru Palade
In this blog we will be logging our progression on a weekly basis. This will include any meetings that took place and matters discussed at said meetings.


That's all for now.