[ExI] Backing up the Cloud

Thu Oct 2 01:06:16 UTC 2008

2008/10/2 Bryan Bishop <kanzure at gmail.com>:
> On Wed, Oct 1, 2008 at 12:43 AM, Emlyn <emlynoregan at gmail.com> wrote:
>> Anyway, I think it's an idea that could work, commercially even.
>
> There's already a few sites that allow you to interface with a lot of
> different websites and steal their information. They are generally
> frowned upon by the websites that are being crawled of course, but
> whatever.

Yes, that kind of spidering has been around for a while.

> There's generally two ways of doing this. You could have a
> backend crawler on your own servers for each of your users, thus most
> of your requests either coming from stolen randomized IP addresses or
> your own, the latter of which will become quickly blocked;

This is probably a mostly fail option; either you have to be really
dodgy, or you'll just get blocked. Probably in reality you would need
to negotiate with each site you wanted to talk to and get them to
agree to allow your service as enhancing theirs. Could be tricky.

> the other
> option is to go all out with the "web 2.0" nonsense and do some fancy
> userscripts and firefox extensions (and the like) to automatically
> pull data that the user comes across in his daily browsing, or all at
> once if necessary. This way, the content can be pulled only once [[not
> that these services are lacking bandwidth (okay, except twitter)]].

Yes, that could work; have automated stuff on the user's machine to do
it as them. Much more error prone though you'd think.

Note that you probably don't want to beat these services to death; I'm
thinking occasional (eg: weekly?) full backups at best per user, or
manually initiated transfer processes to move stuff from one service
to another. Maybe it should always be manually initiated?

> Anyway, it's being done.

Piecemeal, if at all, as far as I can see. Note that the whole class
of techniques for copying stuff from an internet based service to a
file on your local machine is really not what I'm thinking of;
although you could provide that service, you'd be aiming at a
cloud-based system; the user's machine should really be treated as a
window onto that.

> I'd recommend looking into the twitter and
> facebook CLI packages, and maybe prod me to go hunt down those links
> to some services doing this. The name of them escape me because I
> cared so little at the time.

I did a little googling and didn't come up with anything polished and
commercial looking.

Note that the fact that the technologies exist to do this is a good
thing! With something like this, you would ideally want to start in
the knowledge that the guts of your shiny app were going to be based
on well understood techniques and/or pre-existing code.

> Please also consider the reverse
> direction; I have many hundreds of gigabytes, perhaps terabytes, of
> archived and personal data all locally stored. Yes, I can and do
> backup more redundantly than just locally, but it would be interesting
> to consider the reverse direction, i.e. how to publish multiple
> gigabytes from my own sources. In the case of "web 2.0" website from
> one to the other, there's a few synchronization and following
> services, but that's already preformattted data ready to be slurped up
> by those socially-inclined-websites, not just raw HTML, PDF, etc. that
> one might have laying around and of relevance to various social groups
> connected over the servers you're talking about, Emlyn.

I imagine that's a task fairly specific to you, based on the kind of
stuff you have that you want to publish. But you'd have to be more
specific about your intent. Raw HTML and PDF can just go online as-is,
but begs the question how anyone would find anything in there. Maybe
you are thinking of putting it online in a searchable way? Why not
just make it as google friendly as possible and wait for it to be
assimilated?

-- 
Emlyn

http://emlynoregan.com - my home
http://point7.wordpress.com - downshifting and ranting
http://speakingoffreedom.blogspot.com - video link feed of great talks
on eCulture