Aritifical Intelligence - WePS Project

Saturday, 5 January 2013

The next meeting as was decided at the last session was held for today, 27 November 2012. All the members of the team reached well in time. We started with the day's discussion.

As the next step was of stemming that was to be performed, David and Anupreet came with the required research on the stemming portion. Also, the codes were provided by both. It was then that the codes were implemented in the presence of all the five members after editing and listening to everyone's views, and stemming ultimately showed good results and we got one proper working coding schema for stemming. Everyone was happy with this portion of the system.

After stemming, at the end of the meeting it was decided that the next meeting must be for the generation of feature vectors. The members allocated for the task of feature vectors generation agreed to work on the researching of the feature vectors. That was when this meeting came to an end and we left.

Members Present: Alexandru Palade, David Wijaya, Anupreet Kaur, Ste Brown, Andrew Hill.

Thursday, 3 January 2013

During the meeting held on 23rd November 2012 we firstly reviewed the comments made from Jeremy Ellman on the progress we had made so far. We were told that we would be given a lecture to advise us on the breaking down of individual tasks in the coming week.

Prior to this we each we had successfully inserted the removal of stop words into the program, we presented this to the module tutor, he was satisfied with the output, therefore we decided that the next step would be to research into how the text can be cleaned even more.

After an extensive amount of research we discovered that there were many different pre-programmed functions within python that allows the user to clean the text to a higher standard. We decided that the next step that would forward the project would be to divide up each section of the project and allocate each member tasks according to their expertise. The first task we decided on was on was to clean the text, this had already been researched into it just required implementing, Ste was allocated this. The next step was the stemming of the text David and Anupreet were given this. Feature vectors was the next task, Ste was given this. Text similarity was decided as the next task Andy, Alex and Ste were given this. The final section was clustering which Andy, Alex and Anupreet took the liberty of implementing this task.

Attendees: Stephen Brown, Andrew Hill, David Wajiya, Anupreet Kaur, Alexandru Palade

Wednesday, 28 November 2012

Wednesday 21st November - XP cont.

This was the first occasion that all members of the team met together, which was good. This session was much like the one before.

We first discussed what the next developmental step should be. We mutually decided that the removal of stop words should be our next step in order to reduce the size of the data set.

After several fails attempted we managed to incorporate this function into the loop we created last week. The loop takes the text files and removes the stop words using the NLTK, leaving a list of our non stop words. This list is then written back to file.

It was mentioned in this meeting that it would be more productive to split the project into different sections and each member of the team would work on their respective section. We mutually agreed that this was a good idea and look to Jeremy Ellman for advice on how best to do this.

Attendees: Stephen Brown, Andrew Hill, David Wajiya, Anupreet Kaura, Alexandru Palade

Wednesday 14th Novemeber - Extreme Programming

Today we had a second meeting, which was more of an extreme programming session. (in the sense there were 4 of us working on one module)

Firstly we generated a small amount of abstract pseudo code to outline the logical steps we needed to take to make a start on the solution. After this we fired up a python IDLE and started on our first step, which was reading in all of the HTML files.

We managed to create a loop to go through every folder and sub folder in the test data and look for HTML files. Where the loop found HTML files, beautiful soup would extract the text, which then be saved as a plain text file in its respective directory.

We were happy with the results of our first programming session.

Attendees: Stephen Brown, Andrew Hill, David Wijaya, Anupreet Kaur

Wednesday 7th November - Meeting

Today we had our first group meeting. In the meeting we sat down and read through the assignment documentation and discussed how we should start working toward an solution. We came to a mutual decision that we should go home and do some research in order to gain some knowledge of the task at hand. That was where we left the meeting.

Attendees: Stephen Brown, Andrew Hill, David Wijaya, Anupreet Kaur

Inductory Post

As part of our assessment for the Artificial Intelligence module, we have been asked to work in a group of up to 5 to create a solution to the web person search task as defined below:

"This task focuses on the disambiguation of person names in a Web searching scenario. Finding people, information about people, in the World Wide Web is one of the most common activities of Internet users. Person names, however, are highly ambiguous. In most cases, therefore, the results for this type of search are a mixture of pages about different people that share the same name.

The participant's systems will receive as input, web pages retrieved from a web search engine using a given person name as a query. The aim of the task is to determine how many referents (different people)exist for that person name, and assign to each referent its corresponding documents. The challenge is to correctly estimate the number of referents and group documents referring to the same individual." (Artilles et al. 2007 Appendix 1. http://nlp.uned.es/weps/weps-1/weps1-data)

The group members are as follows:

Stephen Brown
Andrew Hill
David Wijaya
Anupreet Kaur
Alexandru Palade

In this blog we will be logging our progression on a weekly basis. This will include any meetings that took place and matters discussed at said meetings.

That's all for now.