iVia Notes on Expert Guided Crawlers
I.) Expert Guided Crawlers
- Expert Guided Crawler:
- A single URL is given and a record is built for that URL
- Multiple URLs can be given and records are built only for those URLs
specified
- Expert Guided Crawler + Drill Down/Out:
- A single URL is given and a record is built for that URL (user option to NOT
build
this record) plus...
- The crawler can be directed (through user settings) to drill down specified
levels
within the site and to drill out specified external jumps from the site. An example is a virtual library where you have 5 levels of
directory pages (going often from general to specific subjects) within the site and each of these has from ten to a few hundred
external
links. So, for example, you could set this crawler for: create record (or not) for the URL given; capture URLs at the URL given
(level 1)
and drill down 5 levels from this URL and extract URLs at each level; from each URL chosen at each level the crawler can go out 2
external
links; at each jump the drill down would be the same (5 levels). Recommended range of user choices: 1-20 levels of drill down (within
given
site) (plus an additional option of unhindered drill-down for the occasional super site Yahoo drill down need...) and up to 2 levels of
drill
out (pursuing external links)...
- Multiple URLs can be given and a. and b. choices are operable.
The above are their official names now.
II.) Modes that We may Want to Design In:
There are many applications for the following possible modes of Expert Guided Crawling:
We can have these crawlers be interactive.
We could combine them with rich text, aboutness seeking.
Determinant Mode:
e.g., hierarchical vl (drill down all levels and go out 1 link)
e.g., preprint collections (drill down 1 level and go out 0 levels) do the whole paper (no aboutness)
In Process Interactive Determinant Mode:
e.g., hierarchical vl (drill down all levels and go out 1 link) then return ti/URL results list for manual review with NO options to
eliminate bogus sites before record building
Explore Mode:
e.g., hierarchical vl (drill down all levels and out 3 links): explore mode (rich text/aboutness mode ON)
e.g., preprint collections (drill down 1 level and go out 0 levels) but use explore mode (rich text/aboutness mode ON) to id and work
with
"abstracts" and "conclusions" only
In Process Interactive Explore Mode:
e.g., hierarchical vl (drill down all levels and out 3 links): explore mode (rich text/aboutness mode ON) then return ti/URL results
list
for manual review with NO options to eliminate bogus
sites before record building
III.) Input/Output:
- Ingest Processes:
- Mass Ingest: The above assumes that there are ways for mass ingestion
of URLs. E.g., we're given a list of several thousand
- Single URL Expert Suggested: An expert pasted in one URL.
- Multiple URLs Expert Suggested: An expert pastes in a dozen or two URLs.
- Output Processes:
NSDL OAI
We will also need MARC21 and standard delimited format
IV.) Targets:
First Target:
http://www.indiana.edu/~cheminfo/
Collections as Lists
http://www.newscientist.com/weblinks/
http://www.nap.edu/csarc.html
http://pests.ifas.ufl.edu/bestbugs/
http://www.100topsciencesites.com/
http://www.education-world.com/science/
http://www.peebles.scoca-k12.org/links/science.htm
http://scorescience.humboldt.k12.ca.us/fast/kids.htm
http://faculty.washington.edu/chudler/neurok.html
Collections offering some kind of Search Engine
http://www.sciencemag.org/netwatch/
http://www.monarchwatch.org/
http://pages.britishlibrary.net/charles.darwin/
http://www.bbc.co.uk/webguide/science/index.shtml
http://agrifor.ac.uk/