Datasets of Dynamic Objects on the Web
Department of Computer Science, UC Irvine
Contributors
Overview
Many applications need to retrieve information about objects from remote Web sources that are autonomous and non-collaborative. An example is a service provider (e.g., Monster.com) that is specific to a certain application domain such as jobs. For such applications, it is critical to understand how the objects at the remote sources change over time. Such information can help the application decide a good crawling schedule in order to maintain a high quality of its data while it does not overuse network resources.
In order to gain an insight on how data objects on the Web change over time, we have collected data objects from six Web sources in four domains, including books, cars, forums, and jobs. Our crawlers collected data objects from these sites during a period of one and a half years on a daily basis. The following table gives an overview of the collected data (with the Web sites anonymized):
Table 1: Overview of the collected
data objects
Domain |
Web Source (# of
categories) |
Crawling Period |
# of Objects |
Cars |
Car Web source 1 (10 models) |
10 months
(2006/1 - 2006/11) |
42,543 |
Cars |
Car Web source 2 (10 models) |
13 months
(2005/10 - 2006/11) |
46,311 |
Jobs |
Job Web source 1 (7 categories) |
4 months
(2006/6 - 2006/10) |
31,157 |
Jobs |
Jobs Web source 2 (2 categories) |
2 months
(2006/12 - 2007/2) |
2,500 |
Books |
Book Web source (3 categories) |
4 months
(2005/8 - 2005/12) |
3,315 |
Forums |
Forums Web source (4 categories) |
2 months (2006/12-2007/02 ) |
25,392 |
We
publish the collected data for researchers who want to study dynamics of Web
objects.
Publications
Index
Car Web source 1 (Download the dataset)
This Web source is one of the Web sources in the car domain we crawled. We
crawled the data for the 10 different car models shown in Table 2.
Table 2:
Different Model and make of the crawled cars |
|
Make |
Model |
BMW |
3 |
BMW |
5 |
Ford |
Explorer |
Ford |
Focus |
Ford |
F150 |
Honda |
Accord |
Honda |
Civic |
Toyota |
Camry |
Toyota |
Corolla |
Dodge |
Durango |
The schema of the collected data is the following:
Table
3: Schema of collected data |
||||||
cid |
mileage |
year |
crDate |
price |
make |
model |
Internal Web source database ID of the car entry |
Mileage of the car |
The year in which the car is built |
The crawling date |
The price of the car |
Make of the car |
Model of the car |
The crawled data covers 42,543 different cars from the following intervals (dates are in YYYY-MM-DD format):
Car Web source 2 (Download the dataset)
Ten different car models at this Web source were crawled (the models
are the same as the models in Table 2). The schema of the collected data
is the following:
Table
4: Schema of collected data |
|||||
cid |
year |
price |
make |
model |
crDate |
Internal Web source database ID of the car entry |
The year in which the car is built |
The price of the car |
Make of the car |
Model of the car |
The crawling date |
The crawled data covers 46,311 different cars from the following intervals (dates are in YYYY-MM-DD format):
Jobs Web source 1 (Download the dataset)
We crawled seven different categories of jobs. We anonymized the category of
object to make data confidential. The schema of the collected objects is the
following:
Table
5: Schema of collected data |
|||
id |
cat |
crDate |
postDate |
Internal Web source database ID of the job entry |
Job category |
The crawling date |
Posting date of the job (as shown on the Web source) |
The collected data includes 31,157 jobs and spans the following intervals (dates are in YYYY-MM-DD format):
Job Web source 2 (Download the dataset)
The jobs with the category of management and technology were crawled
from this Web source. The schema of the data is the following:
Table
6: Schema of collected data |
|||
postDate |
id |
cat |
crDate |
Posting date of the job which is shown on the Web source |
Internal Web source database ID of the job entry (job ID) |
Job category |
The crawling date |
The collected data contains 2,500 jobs and spans the following intervals (dates are in YYYY-MM-DD format):
Book Web source (Download the dataset)
Books with three different subjects are crawled from this Web source: books
on Java, Linux and DBMS (database management systems). The schema of the
data is the following:
Table
7: Schema of collected data |
|||
id |
price |
topic |
crDate |
Internal Web source database ID of the book entry |
Price of the book |
The topic of the book (subject of the book) |
The crawling date |
The collected data contains 3,315 books and spans the following intervals (dates are in YYYY-MM-DD format):
Forums Web source (Download the dataset)
Four different kinds of posts were crawled which are anonymized to make it
data confidential. The schema of the collected objects is the following:
Table 8: Schema of collected data |
|||
id |
cat |
crDate |
replies |
Internal Web source database ID of the post entry |
Post category |
The crawling date |
Number of replies to the post |
The collected data objects for this Web source contains 25,392 posts and spans the following intervals (dates are in YYYY-MM-DD format):
This project is partially supported by the NSF CAREER Award, No. IIS-0238586 and a Smith Faculty Seed Fund of ICS at UCI.
If you have any questions about these datasets, please contact Houtan Shirani-Mehr (hshirani AT uci.edu) or Chen Li (chenli AT ics.uci.edu).