I'm trying to automatically find connections between accounts on different networks - social networks, content hosting sites, other? - that are held by the same person Agent. I'm starting with YouTube, because that's a good source of content creators.
Who?
I haven't figured out a way to reliable pick channels at random (and have since decided that wouldn't be a good way of doing it anyway due to the long tail of people who don't upload anything at all, let alone are 'active' content creators), so I'm starting with the 'standard feeds'. These used to, more sensibly, be called Charts. They're RSS (or Atom or JSON) feeds of statistics about content or channels. They no longer appear on the frontend of the site, but are available if you know where to look. They are mentioned in some of the API documentation, which is referred to in the YouTube Help about generating your own RSS feeds from YouTube content. The standard feeds are:
Most of these are useful in finding popular videos, which means there's a good chance the uploader has a wide network of connections within YouTube (which I can follow to get more information). Many, though, will be one-hit wonders. I've picked Top favorites as a list that intuition says will be more likely populated by videos from channels to which viewers have some kind of loyalty. These days everything you do on YouTube shows up in your friends feeds, so people may favourite a video as part of building their own identity on the service, as well as to support the content creators they love. It demonstrates an active, positive, reaction to the video. It's the content creators who produce content that is received in this way that I'm ultimately aiming to support. Most viewed, discussed, linked and responded could simply be controversial. Recently featured is some YouTube-inner-circle conspiracy, no doubt. This is all my opinion; if anyone has any better insights on these charts, please do let me know.
I'm also limiting the charts to 'this week', to get a fairly - but not too - rapid turnover of data. 'Today' might give me too many less-established one-hit wonder, viral of the moment types; longer term establishes some sort of consistent enjoyment of the video by the masses. 'All time' is a fairly unchanging list, and would mean all my research is based around Charlie Bit Me. (Although this in itself might be an interesting study of content creator evolution; the original video was aimed at close family and friends, went viral by chance, and since the parents and children involved have built a many-$ content creation empire, with sequels and merchandise and all sorts. They've easily made enough from ad revenue to put both kids through college. But that's another discussion).
Why though?
I'd like to know which other networks are most commonly linked to by active content creators. This might indicate what kinds of interactions are meaningful to them. Social networks for interacting with fans? Other content host sites for different versions of their content, or different media types? Independently run websites and portfolios? Online merchandise stores? Other peoples' content they want to share with their viewers (friends and collaborators)?
It might also be interesting to try to find out how often people reuse the same username across sites. And do people link to profiles on other sites that aren't their own? Either profiles they share with collaborators or friends, or just other peoples' profiles entirely? How can I reliably differentiate?
YouTube's provisions for external account linking
YouTube allows people to put links on their channel. They can choose up to four 'social' links to display icons for over their channel banner, plus one 'custom' link. They can also input as many custom links as they like which show up in a list in the About section of their channel.
The predefined list of 'social' links from YouTube is:
- Google+
- Facebook
- Twitter
- Myspace
- Tumblr
- Blogger
- deviantArt
- WordPress
- SoundCloud
- Orkut
- Flickr
- Google Play
- iTunes
- Pinterest
- Instagram
- Zazzle
- CafePress
- Spreadshirt
- LinkedIn
There are crucial things missing from this list, I'm sure - Bandcamp, Newgrounds, off the top of my head - but if this is what YouTube thinks its users want to connect to, then it seems like as good a place to start as any. And of course, if a chosen profile doesn't appear on this list, they can add it (labelled however they want) in the custom links section. The custom links section is also often used for listing secondary (or tertiary or group) YouTube channels, which are fairly commonly found amongst active YouTubers.
Getting these links programmatically
The YouTube API (v3) is balls when it comes to giving me information that is useful in this regard.
Scraping time!
Code is on Bitbucket.
Currently all of these links, regardless of banner, social, or custom, conveniently reside in <li>
s with a class
of custom-links-item
. I BeautifulSouped them out. (Why I can't get this information through the API, I don't know).
Linked Data-ing things
So I'll use FOAF's OnlineAccount to hook all the accounts together as Linked Data, which in theory is a perfect fit. SIOC's UserAccount is also an option, but I'll keep it simple for now.
In related news, YouTube is phasing out usernames. New YouTube channels are now created directly through Google+, with a Google+ ID as the unique identifier. It's trying (to the outrage of YouTubers with any kind of branding or well-known identity) to encourage people to hook up their channels to their G+ profiles, and lose their old username. Once done, this cannot be undone. I'd still expect to be able to find out someone's username if they have one though, given the unique channel ID. The API doesn't return this. You get a channel 'title', which is just a display name. For some people (those with established branding) this will be their ye olde username, but for many - most, I suspect - it's their G+ (supposedly real) name.
It just means that for YouTube channels I have to use the gibberish long unique ID instead of a nice human readable username for the foaf:accountName
. This goes against what I feel accountName
means, but is compliant with the spec, so I guess I'll leave it there.
Everything else at that point is straightforward:
Once the links are got, broken down into their constituent parts with urlparse
, I can use rdflib
to turn them into, eg:
And store them somewhere ... to be continued.
OnlinePersonae
I'll probably subclass Agent
with OnlinePersona
(inspired by K. Faith Lawrence's FanOnlinePersona
) and have the accounts belonging to that. Eventually OnlinePersona
will have more properties which it won't necessarily share with all Agent
s.
Note: SIOC doesn't have a notion of this type. SIOC has UserAccount
which subclasses foaf:OnlineAccount
, and thus defers back to a foaf:Agent
as the account holder.
Sooo... what do I use as URIs for my OnlinePersona
s?
This merits a tangent in the discussion, so I'll make another post about URI issues.
URI locations
Months ago (probably) I thought it would be a good idea to make a PURL for all of my content creation ontology related stuff. I couldn't find any existing sensibly named domains that are public at purl.org... things like '/ontology' are selfishly private. So I created '/content-creation' as a (public!) top-level domain. It's still 'pending approval'. Which means I can't do anything with it. Is purl.org even looked after any more? Grumble.
(Andrei Sambra suggested I use prefix.cc to give my ontology a pretty name. Which looked briefly promising, before I realised it doesn't redirect automatically to an ontology... it's good for humans searching for vocab prefixes, but not for machines by any stretch. Mo validated my feeling that ontology URIs ought to resovle to machine- and human-readable descriptions).
I had been going to use data.inf.ed.ac.uk as the base, but the server that pointed to melted down last month. I dunno when it'll be back. So I'll stick to something I, personally, control. At some point I might buy a more suitable domain specificially for it, but I should discuss the options with some people who know what I'm doing before making a decision by myself. Available candidates right now though include: creativecontent.info, webcontentdb.com/info, internetcontentdb.com/info.
Oh, I just found out that purl.org isn't unfailingly reliable. In that case, forget it.
So for now I'll use:
- rhiaro.co.uk/vocab/oocc# for the ontology spec for any terms of my own (when I write it)
- rhiaro.co.uk/cc/onlinepersona/ for
OnlinePersona
s
- rhiaro.co.uk/cc/content/ for content, when I get that far.
Next
-
Follow the links to find more connections and/or verify ones I've already found. For common social and content sites, I can manually scrape useful information or use their APIs. For independent websites or things I haven't come across before, I shall devise some means to not ignore them altogether...
-
Grab other stuff from the YouTube profile and handle it in the same way. Featured channels may link to other channels the content creator is involved with. Subscriptions and mutual friends may be a good place to go for building up the network.
- Put more into the graph than just the FOAF OnlineAccounts. Start on content..