CS330 Unix, Linux, and Tokenizing Strings

Highlights of this lab:

Unix Introduction
Parts of Unix
More on the Shell
Focus on Permissions
Review of Strings and C Strings
Dynamically Allocating C Strings
Splitting C Strings into Tokens
Dynamic Arrays of C Strings
References and More Info

Lab Code

To get the sample and exercise code, please use the following commands in the directory that you have created for this lab:

    wget www.labs.cs.uregina.ca/330/Linux/Lab3.zip
    unzip Lab3.zip

Unix Introduction

Brief History

In 1969, Ken Thompson of Bell Laboratories wrote the first version of Unix.
In 1973, Ken Thompson teamed up with Dennis Ritchie and they rewrote the Unix kernel in C.
In 1978, two branches of Unix developed
- One maintained by Bell Labs.
  - Eventually known as SysV (System V). Individual releases were abbreviated as SVR#
  - Most common on large multi-user systems.
  - Most popular variant was SVR4 (1990) which incorporated many ideas from the second branch and its variants.
  - Later versions of SysV have been less successful, and its future is dubious.
- The second maintained by UC Berkley's Computer Systems Research Group.
  - Known as BSD (Berkeley Software Distribution) or Berkeley Unix.
  - BSD was mostly used in personal workstations.
  - Developed under a very open license
    - Made it cheap to use
    - Code written for BSD was often borrowed by other OSes such as SVR4 and Windows.
- There are many small differences between the two and their descendents. One difference is in the way the ps command functions.
- Both branches were licensed and heavily modified by major computer companies.
In the early 1990's, Linus Torvalds developed Linux. Linux is a Unix-like OS kernel based loosely on the educational OS MINIX (Minimal Unix). It provides most of the features of SysV and BSD kernels. The GNU Project provides shell and tools that complement Linux and make it a powerful, competitive and popular operating system.

Advantages of Unix

multitasking--multiple programs can run at one time
multi user--more than a single user can work on one machine
safe--permissions on directories, files, or disk drives prevent one program from accessing memory or storage space allocated to another

Unix Flavours

There are many different implementations of Unix. They all have subtle differences in the way that they operate. Most modern Unixes try to comply with the Single UNIX Specification (SUS), and most commercial UNIXes are officially registered as SUS complient.

A few commercial implementations available:

Solaris (Sun Microsystems)–Originally called SunOS and based on BSD. Sun took a new direction with SunOS 5, partnered with AT&T to create SVR4 and used it as the basis for its new OS.
Mac OS X (Apple)–Based on a BSD variant called Darwin. Since Leopard, Mac OS X has been SUS UNIX 2003 compliant.

Linux comes in many different versions, called distributions or distros. Some are free. Others, the user must pay for a support contract. Many smaller distributions are based on one of these. No Linux distros are officially SUS compliant because of the costs involved, but the Linux Standard Base (LSB) includes SUS standards.

Here's a list of some Linux distributions:

Red Hat–the most financially successful Linux company.
- Fedora–division of Red Hat that is free, but the implementation is in the testing phase and may have bugs
- Red Hat Enterprise Linux (RHEL)–linux for professionals. It is stable , featuring a 10 year life cycle for major releases, and a 2–3 year release cycle. Sales and service of RHEL in 2012 made Red Hat the first billion dollar open source company.
SuSE (Novell)
Debian
- Ubuntu–a very friendly and popular version of Linux. Based on Debian. Many consider it the strongest contender for a Linux alternative to Windows.
Slackware–one of the oldest Linux distributions. One of its major goals is to be the most UNIX-like Linux.
Mandriva

By the way, to see the current version of Linux running on a machine, you can try this command: lsb_release -a
or uname -a

Parts of Unix

Unix is organized at three levels:

kernel	“The UNIX kernel is built specifically for a machine when it is installed. It has a record of all the pieces of hardware it needs to talk to and knows what languages they speak (how to turn switches on and off to get a desired result).” http://www.extropia.com/tutorials/unix/kernel.html
shell	The Unix shell provides a user interface. “The most basic UNIX shell provides a 'command line' which allows you to type in commands which are translated by the shell into kernel speak and sent off to the kernel.” http://www.extopia.com/tutorials/unix/shells.html
tools and applications	These provide additional functionality to the operating system. To see some tools that you have access to check out: /bin or /usr/bin

More on the Unix Shell

There are several different shells, they offer their own advantages and disadvantages. For instance, some allow for auto completion using the tab key; others don't.

A few common shells are the following:

sh: Bourne shell
csh: C-shell
tcsh: C-shell enhanced with file name completion and command line editing
Korn shell (ksh)
Bourne again shell (bash)

For more on these shells, click here.

The following has a side by side comparison of some shells https://hyperpolyglot.org/unix-shells

To see what shells exist on your current Unix system, try the following command:

$ cat /etc/shells

To see what shell you are using, try the following command:

$ echo $SHELL

Focus on Permissions

Each file and directory in Unix contains a set of permissions that determine who can access it and how. There are three levels of access to set:

You can restrict access to yourself alone (user)
You can allow users in a predesignated group to have access (group)
You can permit anyone on your system to have access (world)

How do you view permissions?

The ls command with the -l option allows you to view a file's permissions (among other information).

 $ ls -l mydata
 -rw-r--r-- 1 chris weather 207 Feb 20 11:55 mydata

The breakdown of this information is as follows:

File Type	Permissions	Number of Links	Owner Name	Group Name	Size of File in Bytes	Date and Time Last Modified	File Name
-	rw-r--r--	1	chris	weather	207	Feb 20 11:55	mydata

Right now, the owner of mydata has read and write permissions, and the group, and the world have read permissions. How do I know? The permissions are organized in groups of three:

the first three characters (rw-) represent the owner
the next three (r--) represent the group
and the final three (r--) represent the world (or others)

In addition,

'r' stands for read permission
'w' stands for write permission
'x' stands for executable permission
'-' (dash) stands for empty permission

What would the following permissions represent?

-rwxr--r--
drwxr-xr-x
-rwxrw-r--

How does Unix determine who has permissions to access files?

Again it comes down to the /etc/passwd file. In this file, you have a unique numeric id, and a principle group id (also numeric). When you create a file, your unique numeric id and principle group id are assigned to that file. If there is a match of these numbers, then you will have specific permissions (according to whether you are user/group/world).

You have a principle group id, but you may also belong to other groups that are not your principle group. To know what groups you belong to, try the following command:

$ groups

This command gets its information from the /etc/group file as well using your principle group id.

How do I set permissions?

To set permissions, you use chmod. There are two main usages of chmod:

symbolic permission mode
absolute permission mode

Symbolic Permission Mode:

The general format for using the symbolic permission mode is the following:

chmod 'access class' operator 'access type' filename

For example, this would add executable access for the user:

$ chmod u+x testfile

The following summarizes the values of "access class", "operator", and "access type" in the above syntax:

Access Class
- u (user)
- g (group)
- o (other)
- a (all)
Operator
- + (adds permission)
- - (removes permission)
- = (sets exact permissions for access class specified only)
Permissions
- r (read)
- w (write)
- x (execute)
- t (sticky bit) - keep this file in swap. (mostly obsolete)
- s (suid/sgid bit) - Users run this file with same rights as the owner or group for this file.

Given a base permission of -rw------ for a file called "myfile", what would the resulting permission be after the following chmod calls?

chmod u+x myfile
chmod a+x myfile
chmod g+r myfile

Absolute Permission Mode

Another way to change permissions is by using a numeric (octal) code. Typically, you will use three octal numbers: one for the user, one for the group and one for other (world).

The syntax for using chmod in absolute permission mode is:

chmod 'octal permissions' filename

For example:

$ chmod 744 myfile

Each of the three octal digits represent the read, write, and execute permissions for the user, group, and world respectively.

The following table summarizes the octal digits and how the permissions are affected:

Octal	Binary	Permissions
0	000	---
1	001	--x
2	010	-w-
3	011	-wx
4	100	r--
5	101	r-x
6	110	rw-
7	111	rwx

What would the permissions look like on "myfile" after the following chmod calls?

chmod 755 myfile
chmod 644 myfile
chmod 711 myfile

For more on chmod click here

Review of Strings and C Strings

	C++ Strings	C Strings
general	dynamic length, can change length during the program	fixed length determined when declared, ends in '\0'
#include	`#include<string>`	`#include<cstring>`
declaring	`string theString;`	`char cString[100];`
copying	`theString=theString2;`	`strncpy(cString,cString2,100);`
getting a line	`getline(cin,theString);`	`cin.getline(cString,100);`
determining length	`theString.length();`	`strlen(cString);`
comparing	`if (theString==theString2)`	`if(!strncmp(cString,cString2,100))`

A handy thing to know is how to convert a String into a C String (for copying, perhaps?). The syntax is:

strncpy(cString,theString.c_str(),100);

You may also need a review of using getline to read lines until the end of a file. The following is meant as a refresher:

#include <iostream>
#include <fstream>
#include <string>
using namespace std;

int main()
{
    ifstream inFile("test.txt");
    string strOneLine;

    while (inFile)
    {
       getline(inFile, strOneLine);
       cout << strOneLine << endl;
    }

    inFile.close();

    return 0;
}

Note:

getline for reading Strings is useful if you suspect that you will have longer lines, since Strings are dynamically allocated.

Dynamically Allocating C Strings

The following is meant as a review of how to dynamically allocate and free up space.

C++ Style	C Style
char *s4; //determine size + 1 for null s4=new char[strlen("hello") + 1]; //Copy the strings strncpy(s4,"hello",6); //... delete[] s4;	char s4; //determine size + 1 for null s4=(char)malloc(strlen("hello") + 1); //Copy the strings strncpy(s4,"hello",6); //... free(s4);

C++ Style

C Style

char *s4;

//determine size + 1 for null
s4=new char[strlen("hello") + 1];   
//Copy the strings
strncpy(s4,"hello",6);  

//... 

delete[] s4;

char *s4;

//determine size + 1 for null
s4=(char*)malloc(strlen("hello") + 1);   
//Copy the strings
strncpy(s4,"hello",6);  

//...

free(s4);

Well, maybe it's not quite a review. Some of you may not have worked with malloc and free. The reason for introducing it now is that you can reduce the above code to the following:

char *s4;
   
s4=strdup("hello");   //make copy of "hello"

//...

free(s4);

strdup is not a part of the C or C++ standard; it is included in the POSIX standards. If you are lucky enough to be programming on a POSIX compliant OS such as Linux (the lab) or Solaris or Mac OS X then you can use strdup:

strdup(const char *s)

returns a pointer to a new string that is a duplicate of the string pointed to by s. The returned pointer should be released with free() because the space for the new string is obtained using malloc. If the new string cannot be created, a null pointer is returned.

Notes:

When you allocate space for a character array, always remember to count the space taken up by the null terminator '\0'
You cannot do assignment as in:
```
char s5[6];
   
s5="bye"; // **WRONG**
```
You must instead use the strncpy function. (Or the strdup function if it is available.)
Don't forget to release any memory you request while copying strings after you are done using it.
In C code, when you use C String functions, you will use
#include <string.h>
In C++, when you use C String functions, you will see instead:
#include <cstring>
In C code, when you use malloc and free functions, you will use
#include <stdlib.h>
In C++, when you use malloc and free you will see instead:
#include <cstdlib>

Splitting C Strings into Tokens

Sometimes you may want to split a line into tokens or words. To do that, there is a C String function called strtok. The prototype is:
char * strtok (char * str, const char * delimiters)
where str is the line (or C string) that you want to split into tokens or words, and delimiters are an array of characters in which any one of the characters delimits or marks the boundaries between words.

The following is an example of using strtok:

#include <iostream>
#include <cstring>
using namespace std;

int main(int argc, char *argv[])
{
   char cstr1[]="This is a sample string. Is it working?";
   char delim[]=" ,.-;!?";
   char *token;

   cout << "cstr1 before being tokenized: " << cstr1 << endl << endl;

   //In the first call to strtok, the first argument is the line to be tokenized
   token=strtok(cstr1, delim);
   cout << token << endl;

   //In subsequent calls to strtok, the first argument is NULL
   while((token=strtok(NULL, delim))!=NULL)
   {
         cout << token << endl;
   }
}

The output:

cstr1 before being tokenized: This is a sample string. Is it working?

This
is
a
sample
string
Is
it
working

There are a couple of "catches" with strtok:

In the first call to strtok, the first argument is the line or C string to be tokenized; in subsequent calls to strtok, the first argument is NULL. Notice the two calls from the lines above:
- token=strtok(cstr1, delim)
- token=strtok(NULL, delim)
The original C string is modified when it is tokenized so that delimiters are replaced by null terminators ('\0'). The following represents what the C string in the sample code will look like after tokenizing:

Dynamic Arrays of C Strings

Sometimes you want to have a dynamically created array of C Strings. The following code demonstrates this:

#include <iostream>
#include <cstring>
using namespace std;

int main ()
{
  char **words;
  char tempWord[100];
  char endWord[]="330!";

  words = new char *[3]; //allocate pointers to three words
  //OR words = (char **) malloc (sizeof(char *) * 3);

  //--------------
  //get two words from the user input--use strdup to dynamically allocate space 
  cout << "Please input a word (less than 100 characters): ";
  cin >> tempWord;
  words[0]=strdup(tempWord);

  cout << "Please input a second word (less than 100 characters): ";
  cin >> tempWord;
  words[1]=strdup(tempWord);

  //--------------
  //the third one hard code copy of "330!" (endWord)
  words[2]=strdup(endWord);
  
  //--------------
  //print and clean up individual words as you go
  for (int i=0; i<3; i++)
  {
     cout << words[i] << endl;
     free(words[i]);   //remember that space was set aside by strdup 

  }
  
  //--------------
  //Clean up the array of words
  delete [] words;     // cleans up words = new char *[3];
  //OR if you used malloc: free (words);

}

References and More Info

For the history of Unix and Linux:
For a definition of the Unix kernel: http://www.extropia.com/tutorials/unix/kernel.html
For a definition of the Unix shell: http://www.extropia.com/tutorials/unix/shell.html
For information on how you get your shell and permissions: http://www.tldp.org/HOWTO/Unix-and-Internet-Fundamentals-HOWTO/login.html
For more information about getline for C Strings: http://www.cplusplus.com/reference/iostream/istream/getline.html
For information on how to use getline for strings: http://www.cplusplus.com/reference/string/getline/?kw=getline

Extra Info:

Check out companies using linux: http://mtechit.com/linux-biz/
The Linux Documentation Project (LDP): http://www.tldp.org/