Sunday, July 22, 2007
SF Status pages: bug fix
Just wanted to let you guys know about a bug fix on the status pages for the SF data. Each project has a "status" (i.e. beta, alpha, production, etc). We were under the impression that each project only had a single status (we assumed this represented the project's "current" status), but this turned out not to be the case.
Our code was therefore erroneously grabbing only the first of a possible list of status codes. Consequently, some projects that had multiple status codes were not shown correctly.
I have gone into the June 2007 and April 2007 data and made the corrections, and I'll get to the older data sets at some point. (Let me know if this is a high priority for you and I'll try to get to them faster.) Obviously, the upcoming August data set will now be run correctly.
When looking through the old packages, you'll want to look for the files marked ".fixed.bz2".
Our code was therefore erroneously grabbing only the first of a possible list of status codes. Consequently, some projects that had multiple status codes were not shown correctly.
I have gone into the June 2007 and April 2007 data and made the corrections, and I'll get to the older data sets at some point. (Let me know if this is a high priority for you and I'll try to get to them faster.) Obviously, the upcoming August data set will now be run correctly.
When looking through the old packages, you'll want to look for the files marked ".fixed.bz2".
Wednesday, July 04, 2007
debian data released
I collected some debian package data and started parsing it to see what kind of stuff we might find in there.
I will probably need some help from the user community on this one, to know what sort of data you find interesting in these packages.
Here are the files I collected:
Obviously there is a lot of information there, and I only parsed some of it out for this initial run. Here are the items I parsed and released:
I'd love to hear from the community about what items you would like to see parsed out.
I will probably need some help from the user community on this one, to know what sort of data you find interesting in these packages.
Here are the files I collected:
- project home pages for stable, unstable, and testing versions (Example of stable page)
- copyright pages (Example of a copyright page)
- developer information page (Example of a developer information page)
- changelog page (Example of a changelog)
- bug reports page (Example of a bug reports page)
Obviously there is a lot of information there, and I only parsed some of it out for this initial run. Here are the items I parsed and released:
- package name, version, parent directory
- any URLs found in the copyright page, and any URLs found within the textual description of the project found on the stable project page
- developers (maintainers and co-maintainers listed on the developer information page)
I'd love to hear from the community about what items you would like to see parsed out.