Monday, October 26, 2009

5. Extensibly Marked Up

I'm new to XML (eXtensible Markup Language) but as I embark on an information career, I will need to be familiar with it. XML defines data and allows information to be moved from one place to another. It is used to organise information, such as in records management, for example, the Dublin Core Metadata Initiative, which is an international standard that ensures records can be shared or transferred and in archiving, the Encoded Archival Description (EAD) is the internationally recognised schema. XML is also used in bibliography and referencing packages such as EndNote and RefWorks, which enables resources to be transferred to other packages as well as word processing software.

I produced this list of magazines using XML. It's a very basic structure, having only the title, subject and date of issue as the categories. To improve the search and make it more detailed, I could also add a list of contents and writers/contributors and where the magazine is located.

XML is like a family. It contains one root element, the parent, while the elements below the root are the children. In this example, the root element is the magazine list and the child elements are the title, subject and date.



XML tags must be well-formed and valid otherwise the document will not display properly. Unlike HTML, XML must have open and closed tags and it has to contain consistent syntax. For example, if you use "yes" or "no", you cannot use other words such as "maybe".

A document type definition (DTD), describes the validity of the XML:


Working with XML and DTD requires a really good eye for detail. DTD is very limited as it's not flexible and it looks like a different language to XML, hence the XML Schema is becoming increasingly popular.

Monday, October 19, 2009

4. Worth more than a thousand words

I'm put off by websites that are text-heavy. Images help to improve presentation and can also be a useful aid to people who don't have English as their first language. Information management systems such as bibliography packages tend not to have images other than a logo - functionality is important but personally I think images would make people want to use the systems.

The most commonly used formats for websites are Gif, Jpeg and Png:

Gif (graphics interchange format) images record up to 256 colours and are lossless as they retain information when compressed. Gifs can be used to create animated files by placing one image upon another.

Jpeg (joint photographic experts group) supports more than 16 million colours and is the common format for photographs. They are lossy as they lose information when compressed.

Png (portable network graphics) is similar to Gifs in that it is lossless but is higher quality as it caters for up to 24 bits in colour. However, it cannot be animated.

This link illustrates what happens to images when they are resized and resampled. Throughout this blog, I've used different image formats to show the differences.

When digitising records, you need to consider which format is most appropriate for what purpose. For example, in archiving and preservation, it would need to be at optimum resolution whereas for the web, it would be lower quality so that they load up quickly. Using formats that support metadata such as Tagged Image File Format (TIFF) is recommended by the National Archives when preserving images as this would enable high searchability.

Images can be added to websites by embedding a file that's saved on your drive (example 1). It can also be linked from an online source (example 2). Be aware that if the file is moved or removed, then you will lose the image from your website.

Also be mindful of copyright issues. There are websites where you can obtain copyright-free photographs, such as copyright-free.com and freeimages.co.uk.

Monday, October 12, 2009

3. WWW and the Net

I confess that, like a lot of people, I've used the terms Internet and World Wide Web in the wrong context. The Internet is the infrastructure, which allows computers to communicate with each other while the World Wide Web is a system of hyperlinked documents or hypertext that is accessed through the Internet.


How the World Wide Web works ©CERN

The Internet was based on an information sharing system designed by the US military (DARPA) in the 1960s while the Web was developed by CERN, the European Organization for Nuclear Research in the early 1990s.

Each computer on the Net has a unique identification number. This is done through the Internet Protocol (IP). You can find your IP number here. The IP address is translated into a domain name to make it easier to recognise. The Uniform Resource Locator (URL) enables us to find documents and files on the internet. The format is: protocol://servername/local file path. For example, my website URL is http://www.student.city.ac.uk/~abhp645/. The CERN website explains how the Web works.

This system was developed in the 1990s by Sir Tim Berners-Lee, who in October 2009 admitted the format didn't really need the two slashes // (see Daily Telegraph article).

When creating a webpage, you can use marked up text. Markups can represent the style and structure of the webpage. W3Schools website provides an easy to follow tutorial on html.



Here's a link to my website. I added hyperlinks to connect from one document to the main page. This links to the html exercise that I produced. If you view the page source, you will see the html markup. I used the University's Unix system to publish the document onto the Web so that it is now visible to a global audience.

Monday, October 5, 2009

2. Bits, Bytes and Binary

In primary school, my teacher taught us computer science by using an empty cereal packet, knitting needles and hole-punched card. This was before I progressed to binary. This YouTube video shows how simple binary is.



Computers process bits in multiples of eight ie 8, 16, 32, 64 and 128. A bit can be either 0 or 1, while a byte is a sequence of bits. The more bits, the greater the ability to access more data.

Binary can represent text. For example:

01110111  01101111  01110010  01100100
is binary for word

Converting text to binary can take a while, so using an encoder such as the one on The Problem Site is really useful.

From the 1970s, converting binary into text was through the American Standard Code for Information Interchange (ASCII) which is basic text. When you use Notepad, you will see the limitations of ASCII in terms of language and formatting. When Unicode was developed, it allowed different language scripts such as 中文 (ie Chinese). ASCII is built into Unicode and is compatible with it.

I produced my website using Notepad before converting it into a web document by changing the extension to .html. If you right click and view page source, you can see the html markup.

Some documents can only be viewed properly with specific programs, for example, you will need Microsoft Word for a document with .doc extension. Problems can also arise when documents produced in new versions of the software are not compatible with older ones.