中文XML论坛--Analysis 2009: Semantics continues to not be RDF, but enrichment, classification and taxonomy

http://broadcast.oreilly.com/2009/01/analysis-2009-semantics-contin.html

By Kurt Cagle January 6, 2009

Within the realm of computational semantics, there is still a fairly broad disconnect between triple pair semantics, the use of RDF (or turtle notation) to create atomic assertions, and the realm of semantics as reflected on the web. I do not expect this to change much in 2009, save perhaps that the gulf between the two will likely just get wider.

While I think that RDF (or more likely a successor to that set of specifications) will eventually go on to becoming the overall semantic tier of the web, its rather depressing just how far we really are from RDF actually becoming widely adopted. Instead, the approaches themselves are still running largely into the proprietary realm, though there are a few interesting areas that should be watched fairly closely.

One of the open source projects that received a fair amount of buzz at OSCON 2008 was Freebase (http://www.freebase.com), a site which mines Wikipedia in order to establish an extraordinarily comprehensive RDF database about topics gleaned from the linkages found within Wikipedia. One benefit of this is that you can do such queries as "List all movies containing aliens", and with the tools at hand, it will show content that matches this particular query. This in turn makes it possible to create relational queries on Wikipedia data, making it particularly useful as a research and data mining tool.

My sense is that you'll see more of these types of applications showing in 2009, apps built around the power of RDF (and increasingly SparQL, the RDF/OWL query language). However, its also likely that few, if any, of these sites are likely to tout their Semantic Web credentials (or even acknowledge that this is what is going on under the hood).

One area that I feel is poised to really take off in the next year is content enrichment. Enrichment involves taking a collection of text, running a series of rules and contextual filters on the data looking for names, events and patterns, then encasing this content within specialized XML markup. Depending upon the database, the source, and the service agreements involved, such enrichment performs an invaluable service in being able to establish the context of a given phrase within an article, and by extension being able to provide both an abstract of the content and specialized search looking for meta-content within a document.

For instance, an article about Barack Obama and John McCain could be abstracted as being about the presidential contest, while an article about Barack Obama and George Bush might talk about transitions of power from one president to the next, with specific terms for each of these people (and related people determined by this context, highlighted as tagged content).

Again this is a service that both commercial and open source XML databases and content management systems are beginning to provide to their customers, and this also illustrates what is increasingly becoming the norm in business applications, situations where critical processing of data streams are applied through web services by third party providers.

This is an area that is ripe for standardization. I suspect that the RDF crowd will probably be jumping up and down at this stage screaming "Use RDF! Use RDF!!" but I'm not really sure that will end up happening, at least not directly. I wrote last year about CURIEs and RDFa, which is an attribute-carried RDF descriptor language for text content, and with the specification now made into a full Recommendation (as of October, 2008), I suspect that it may start making its way in as an alternative offered format by many vendors, which raises the very real possibility that it could become the de facto standard for enrichment (or form the foundation for same) by late 2010.

My central problem with RDF is that it is a brilliant technology that tried to solve too big a problem too early on by establishing itself as a way of building "dynamic" ontologies. Most ontologies are ultimately dynamic, changing and shifting as the requirements for their use change, but at the same time such ontologies change relatively slowly over time.

This means that the benefit of specifying a complex RDF Schema on an ontology - which can be a major exercise in hair pulling - is typically only advantageous in the very long term for most ontologies, and that in general the flexibility offered by RDF in that regard is much like trying to build a skyscraper out of silly putty. It's possible to do so (maybe), but the drawbacks in the increased complexity of code (especially given that most people are still having trouble understanding the relatively simple syntax of XPath) makes it a dubious proposition at best except for those highly interconnected information spaces with comparatively few constraints acting on it such as Freebase.

What I see happening instead is that there should be fairly significant consolidation of specifications down to a few consortia standards in any given domain - such as XRBL in business reporting, HL7 in health care, S1000D in airline specifications and so forth. Even five years ago, most industries tended to have two or more distinct standards competing for adoption, but in the last year many of these dual standard industries have either settled on one or merged these two standards together. Thus I see 2009 being devoted towards application development around an industry's preferred vertical ... with opportunities especially for those who work in developing such standards in the first place.

In other words, its very likely that in order for the RDF/Semantic Web approach to gain credence in these spaces, ontologists will have to start with these specific industry schemas and develop RDF-based tools that model them. Given that I see XML databases increasingly carrying the load in working with these schemas, this will also likely result, at some point in the not too distant future, of a need for a meeting of the minds between the XQuery working group and the SparQL working group in order to develop a SparQL analog that can be run in XQuery, probably as a set of optional modular extensions to the language. I don't know if this is on the agenda at the W3C yet, though if its not, then its likely we won't see significant traction there until 2011 at the earliest.

The other area where there's been something of a "small s" semantic revolution has been the growing awareness of the intimate link between web navigation and knowledge navigation among both web developers and semantics specialists. As web sites grow, they become more complex, deeper, and far more difficult to maintain in terms of their underlying structure.

Ultimately this comes down to a question of classification and partition of the topics within the site itself, and this in turn points to a potential semantic solution for managing large and topically interconnected content. The folksonomy "tagging" revolution (which I think is probably running out of steam) was a significant first step, but folksonomies are by their nature unstructured and poorly regulated.

I think this is going to be the year that a lot of both web design and web framework support is going to embrace semantic tools and concepts (the inclusion of RDF support within the taxonomy-heavy Drupal system is a good case in point).


	W 3 C h i n a ( since 2003 ) 旗下站点苏ICP备05006046号《全国人大常委会关于维护互联网安全的决定》《计算机信息网络国际联网安全保护管理办法》	4,449.219ms