[okfn-discuss] What Do We Mean by Componentization (for Knowledge)?

Tue May 1 11:53:33 UTC 2007

Also at:

<http://blog.okfn.org/2007/04/30/what-do-we-mean-by-componentization-for-knowledge/>

~rufus

## Background

Nearly a year ago I wrote a short essay entitled [The Four Principles of 
(Open) Knowledge 
Development](http://blog.okfn.org/2006/05/09/the-four-principles-of-open-knowledge-development/) 
in which I proposed that the four key features features of a successful 
(open) knowledge development process were that it was:

   1. Incremental
   2. Decentralized
   3. Collaborative
   4. Componentized

As I emphasized at the time the most important feature -- and currently 
least advanced -- was the last: Componentization. Since then I've had 
the chance to discuss issue further, most recently and extensively at 
[Open Knowledge 1.0](http://www.okfn.org/okcon/) and this has prompted 
me to re-evaluate and extend the ideas I put forward in the original essay.

## What Do We Mean By Componentization?

 > Componentization is the process of **atomizing** (breaking down) 
resources into separate reusable **packages** that can be easily recombined.

Componentization is the most important feature of (open) knowledge 
development as well as the one which is, at present, least advanced. If 
you look at the way software has evolved it now highly componentized 
into packages/libraries. Doing this allows one to 'divide and conquer' 
the organizational and conceptual problems of highly complex systems. 
Even more importantly it allows for greatly increased levels of reuse.

The power and significance of componentization really comes home to one 
when using a package manager (e.g. apt-get for debian) on a modern 
operating system. A request to install a single given package can result 
in the automatic discovery and installation of all packages on which 
that one depends. The result may be a list of tens -- or even hundreds 
-- of packages in a graphic demonstration of the way in which computer 
programs have been broken down into interdependent components.

## Atomization

Atomization denotes the breaking down of a resource such as a piece of 
software or collection of data into smaller parts (though the word 
atomic connotes irreducibility it is never clear what the exact 
irreducible, or optimal, size for a given part is). For example a given 
software application may be divided up into several components or 
libraries. Atomization can happen on many levels.

At a very low level when writing software we break thinks down into 
functions and classes, into different files (modules) and even group 
together different files. Similarly when creating a dataset in a 
database we divide things into columns, tables, and groups of 
inter-related tables.

But such divisions are only visible to the members of that specific 
project. Anyone else has to get the entire application or entire 
database to use one particular part of it. Furthermore anyone working on 
any given part of one of the application or database needs to be aware 
of, and interact with, anyone else working on it -- decentralization is 
impossible or extremely limited.

Thus, atomization at such a low level is not what we are really 
concerned with, instead it is with atomization into **Packages**:

## Packaging

By packaging we mean the process by which a resource is made reusable by 
the addition of an external interface. The package is therefore the 
logical unit of distribution and reuse and it is only with packaging 
that the full power of atomization's "divide and conquer" comes into 
play -- without it there is still tight coupling between different parts 
of a given set of resources.

Developing packages is a non-trivial exercise precisely because 
developing good *stable* interfaces (usually in the form of a code or 
knowledge API) is hard. One way to manage this need to provide stability 
but still remain flexible in terms of future development is to employ 
versioning. By versioning the package and providing 'releases' those who 
reuse the packaged resource can use a specific (and stable) release 
while development and changes are made in the 'trunk' and become 
available in later releases. This practice of versioning and releasing 
is already ubiquitous in software development -- so ubiquitous it is 
practically taken for granted -- but is almost unknown in the area of 
knowledge.

## A Basic Example: A Photo Collection

Imagine we had a large store of photos, say more than 100k of individual 
pictures (~50GB of data at 500k per picture). Suppose that initially 
this data is just sitting as a large set of files on disk somewhere. 
Consider several possibilities for how we could make them available:

1. Bundle all the photos together (zip/tgz) and post them for download. 
Comment: this is a very crude approach to componentization. There is 
little atomization and the 'knowledge-API' is practically non-existent 
(it consists solely of the filenames and directory structure).

2. In addition tag or categorize the photos and make this database 
available as part of the download. Comment: By adding some structured 
metadata we have started to develop an 'knowledge-API' for the 
underlying resource that makes it more useful. One could now write a 
screensaver program which showed photos from a particular category or 
auto-import photos by their area.

3. In addition suppose the photos fall into several well-defined and 
distinct classes (e.g. photos of animals, of buildings and of works of 
art). Divide the photo collection into these three categories and make 
each of them as a separate download. Comment: A initial step on 
atomizing the resource to make it more useful, after all 5GB is rather a 
lot to download for one photo.

4. In addition to dividing them up allow different people to maintain 
the tags for different categories (one might imagine those knowledgeable 
about animals are different from those knowledgeable about art). 
Comment: Atomization assists the development of good knowledge-APIs (the 
human mind is limited and divide and conquer helps us deal with the 
complexity).

5. Standardize the ids for each photo (if this hasn't been done already) 
and separate the tags/categories data from the underlying photo data. 
This way multiple (independent) groups can provide tags/categorization 
data for the photos. Comment: Repackaging -- along with the development 
of a better knowledge-API for the basic resource -- allows a dramatic 
decrease in the level of coupling and increase the scope for independent 
development of complementary libraries (the tags). This in turn will 
increase the utility to end users.

## Conclusion

In the early days of software there was also little arms-length reuse 
because there was little packaging. Hardware was so expensive, and so 
limited, that it made sense for all software to be bespoke and little 
effort to be put into building libraries or packages. Only gradually did 
the modern complex, though still crude, system develop.

The same evolution can be expected for knowledge. At present knowledge 
development displays very little componentization but as the underlying 
pool of raw, 'unpackaged', information continues to increase there will 
be increasing emphasis on componentization and reuse it supports. (One 
can conceptualize this as a question of interface vs. the content. 
Currently 90% of effort goes into the content and 10% goes into the 
interface. With components this will change to 90% on the interface 10% 
on the content).

The change to a componentized architecture will be complex but, once 
achieved, will revolutionize the production and development of open 
knowledge.

-- 
Executive Director, Open Knowledge Foundation
m: +44 (0)7795 176 976
www: http://www.okfn.org/ | blog: http://blog.okfn.org/