CodeBook
Course Project for Gettting & Celaning Data based on Human Activity Recognition Using Smartphones Dataset
This CodeBook that describes the variables, the data, and any transformations or work that was performed to clean up
the source data to create a tidy dataset as per requirements of course project.
Information about Source Data Experment
The experiments have been carried out with a group of 30 volunteers within an age bracket of 19-48 years.
Each person performed six activities (WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS, SITTING, STANDING, LAYING) wearing
a smartphone (Samsung Galaxy S II) on the waist. Using its embedded accelerometer and gyroscope, we captured 3-axial
linear acceleration and 3-axial angular velocity at a constant rate of 50Hz.
The experiments have been video-recorded to label the data manually. The obtained dataset has been randomly partitioned
into two sets, where 70% of the volunteers was selected for generating the training data and 30% the test data.
Original Data Source
Human Activity Recognition Using Smartphones Dataset
Data for analysis is downloaded from the below URL
Structure of Data present in source data folders
README.txt: Details of all the files in downloaded folderfeatures_info.txt: Shows information about the variables used on the feature vector.features.txt: List of all features.i.e list of all measurement variablesactivity_labels.txt: Lists the activity Id with their corresponding activity name.train/X_train.txt: Training set.train/y_train.txt: Training activity Id Labelstrain/subject_train.txt: Each row identifies the subject who performed the activity for each window sample. Its range is from 1 to 30.test/X_test.txt: Test set.test/y_test.txt: Test activity Id Labelstest/subject_train.txt: Each row identifies the subject who performed the activity for each window sample. Its range is from 1 to 30.
The following files are available for the train and test data. Their descriptions are equivalent.
- train/Inertial Signals/total_acc_x_train.txt: The acceleration signal from the smartphone accelerometer X axis in standard gravity units g. Every row shows a 128 element vector. The same description applies for the
total_acc_x_train.txt and total_acc_z_train.txt files for the Y and Z axis.
- train/Inertial Signals/body_acc_x_train.txt: The body acceleration signal obtained by subtracting the gravity from
the total acceleration.
- train/Inertial Signals/body_gyro_x_train.txt: The angular velocity vector measured by the gyroscope for each window sample. The units are radians/second.
Note: All the files in train/Inertial Signals and test/Inertial Signals will not be used for in this analysis
Details about Files to be used in analysis from Source Data
Common Files
- features.txt: 561 rows of 2 varibles (feature Identifier and feature Name)
- activity_labels.txt: 6 rows of 2 variables (activity identifier and activity name)
Test Dataset
- xTest.txt: 2947 rows of 561 measurement variables. These are measurement variables listed in features.txt
- yTest.txt: 2947 rows of 1 variables. This is the activity Identifier
- subjectTest.txt: 2497 rows of 1 variable (subject Identifier)
Training Dataset
- xTrain.txt: 7352 rows of 561 measurement variables. These are measurement variables listed in features.txt
- yTrain.txt: 7352 rows of 1 variables. This is the activity Identifier
- subjectTrain.txt: 7352 rows of 1 variable (subject Identifier)
Map of aggregated Data
| Variable Names | subjectId | activityId | (variable names from features.txt) |
|---|---|---|---|
| Data | subjectTest.txt |
yTest.txt |
xTest.txt |
| Data | subjectTrain.txt |
yTrain.txt |
xTrain.txt |
Requirements & Details of Transformations through run_analysis.R script
Requirements
run_analysis.R script has the following requirements to perform transformation on UCI HAR Dataset.
- Merges the training and the test sets to create one data set.
- Extracts only the measurements on the mean and standard deviation for each measurement.
- Uses descriptive activity names to name the activities in the data set
- Appropriately labels the data set with descriptive activity names.
- Creates a second, independent tidy data set with the average of each variable for each activity and each subject.
Detailed Functions of run_analysis.R Script
- Downloads the dataset from the URL mentioned above and unzips it to create UCI HAR Dataset folder
- Imports "test" and "train" datsets and creates data frames from then and then Merges the training and the test sets
to create one data frame. - Extracts a subset of data with only the measurements on the mean "mean()" and standard deviation "std()" for each measurement. Also excludes meanFreq()-X measurements or angle measurements where the term mean exists resulting in
66 measurement variables. - Updates the variable names in dataframe variable names for data fame to improve readibility
- Appropriately labels the data set with descriptive activity names in place of activity Ids
- Reshapes dataset to create a data frame with average of each measurement variable for each activity and each subject
- Writes new tidy data frame to a text file to create the required tidy data set file of 180 observations and 68 columns (2 columns for activityName and subjectID and 66 columns for measurement variables)
Transformations Performed on the original dataset.
Merging the training and the test sets to create one data set.
Activities:
- Download the dataset from the URL mentioned above and unzip it to create UCI HAR Dataset folder.
- Script Imports
testandtraindatsets and creates data frames from then and then merges the training and the test sets to create one data frame.
All files to be used as listed above are imported to created data frames and column variables names are updated as follows
| data.frame | Variable Names |
|---|---|
featureVariables |
"varId", "varName" |
activityLabels |
"activityId", "activityName" |
xTest |
featureVariables$varName |
yTest |
"activityId" |
subjectTest |
"subjectId" |
xTrain |
featureVariables$varName |
yTrain |
"activityId" |
subjectTrain |
"subjectId" |
Merge Activities
subjectTest, yTest, xTest were column bind using cbind function to create testData data frame which added
"subjectId", "activityId" to dataset making it 563 column data.frame with 2947 rows
subjectTrain, yTrain, xTrain were column bind using cbind function to create trainData data.frame which added "subjectId", "activityId" to dataset making it 563 column data.frame with 7352 rows
Aggregared and Merged Dataset
testData and trainData data.frame were rowbound using rbind function to create final aggregated dataset/data.frame AggregateData with 10299 rows and 563 columns
Below code shows few details of AggregateData
> head(names(AggregateData),10)
[1] "subjectId" "activityId" "tBodyAcc-mean()-X" "tBodyAcc-mean()-Y"
[5] "tBodyAcc-mean()-Z" "tBodyAcc-std()-X" "tBodyAcc-std()-Y" "tBodyAcc-std()-Z"
[9] "tBodyAcc-mad()-X" "tBodyAcc-mad()-Y"
>
> head(AggregateData[1:5])
subjectId activityId tBodyAcc-mean()-X tBodyAcc-mean()-Y tBodyAcc-mean()-Z
1 2 5 0.2571778 -0.02328523 -0.01465376
2 2 5 0.2860267 -0.01316336 -0.11908252
3 2 5 0.2754848 -0.02605042 -0.11815167
4 2 5 0.2702982 -0.03261387 -0.11752018
5 2 5 0.2748330 -0.02784779 -0.12952716
6 2 5 0.2792199 -0.01862040 -0.11390197
>
Extraction of only the measurements on the mean and standard deviation for each measurement.
Activities:
- Extract a subset of data with only the measurements on the mean
mean()and standard deviationstd()for
each measurement
Extraction of selected measurement values
grepfunctions are used to search for occurance of meanmean()and standard deviationstd()inAggregateData
variable Names using escape characters.- Using escape characters to search exactly for
mean()andstd()occurance helps to excludemeanFreq()-Xmeasurements
and/or angle measurements where the termmeanexists - The resulting selection would have only
66measurement variables.
Below code shows search using grep functions in column names of AggregateData data frame.
> head(featureVariables[grepl("mean\\(\\)|std\\(\\)", names(AggregateData)), ], 10)
varId varName
3 3 tBodyAcc-mean()-Z
4 4 tBodyAcc-std()-X
5 5 tBodyAcc-std()-Y
6 6 tBodyAcc-std()-Z
7 7 tBodyAcc-mad()-X
8 8 tBodyAcc-mad()-Y
43 43 tGravityAcc-mean()-Z
44 44 tGravityAcc-std()-X
45 45 tGravityAcc-std()-Y
46 46 tGravityAcc-std()-Z
>
>
>
> tail(featureVariables[grepl("mean\\(\\)|std\\(\\)", names(AggregateData)), ])
varId varName
518 518 fBodyBodyAccJerkMag-mad()
519 519 fBodyBodyAccJerkMag-max()
531 531 fBodyBodyGyroMag-mad()
532 532 fBodyBodyGyroMag-max()
544 544 fBodyBodyGyroJerkMag-mad()
545 545 fBodyBodyGyroJerkMag-max()
>
- In order to extract a subset of only measurements on the mean
mean()and standard deviationstd()grepis used
to create index of matched column numbers. - This index of column numbers is used to create subset of data based on
mean()andstd()column indices and also includng first 2 columns which havesubjectIdandactivityIdvalues.
Below code show the Index of occurance for mean() and std()
> head(grep("mean\\(\\)|std\\(\\)", names(AggregateData)), 10)
[1] 3 4 5 6 7 8 43 44 45 46
>
This stage crates a data.frame subsetAggregateData of 10299 observations and 68 variables
Clean up variable names to in the data set
Activities:
- Update the variable names in dataframe for data fame to improve readibility
Variable Transformation activities
camelcasehas been used for R objects created and variable names for data fame to improve readibility.gsubfunction is used to remove instances of"-"and"()"from variables names in the extracted subset
subsetAggregateDatadata.frame- Instances of
meanare replaced byMeanandstdbyStdto obtain proper camelcase in variable names
e.g."tBodyAcc-mean()-X"is changed to"tBodyAccMeanX"
Usage of gsub function
gsub("-", "", names(subsetAggregateData))
Below code shows the changes in variable names after cleaning up variable names.
> head(names(AggregateData))
[1] "subjectId" "activityId" "tBodyAcc-mean()-X" "tBodyAcc-mean()-Y"
[5] "tBodyAcc-mean()-Z" "tBodyAcc-std()-X"
>
> head(names(subsetAggregateData))
[1] "subjectId" "activityId" "tBodyAccMeanX" "tBodyAccMeanY" "tBodyAccMeanZ" "tBodyAccStdX"
>
Label the data set with descriptive activity names.
Activities:
- Appropriately label the data set with descriptive activity names in place of
activity Ids
ActivityName Update
activityLabelsandsubsetAggregateDatadata frames are merged usingmergefunction to add a new column
and create a new data.framesubFinalDatawith corresponding"activityName"for each"activityId"in each row of the
dataset
The below code snippet shows updated activityName in the dataset
> head(subFinalData[1:5])
activityId activityName subjectId tBodyAccMeanX tBodyAccMeanY
1 1 WALKING 26 0.2314146 -0.017722438
2 1 WALKING 29 0.3312213 -0.018502366
3 1 WALKING 29 0.3755700 -0.024728610
4 1 WALKING 29 0.2332297 -0.034451457
5 1 WALKING 29 0.2362494 -0.014396940
6 1 WALKING 29 0.2645428 0.002484389
>
Final Dataset finalData
"activityId"column which is no longer needed because we mappedacitityNametoactivityIdin the dataset"activityId"column is dropped to create final data.frame calledfinalData- This data frame has
10299observations and68columns. 2columns for"activityName"and"subjectId"and remaing 66 for measurement variables with measurements
on themean()andstd()
Tidy Data Set with the average of each variable for each activity and each subject
Activities:
- Reshape dataset to create a data frame with average of each measurement variable for each activity and each subject
- Writes new tidy data frame to a text file to create the required tidy data
Transformation Details - melt and dcast
reshape2package is leveraged for reshaping the dataset.library(reshape2)meltfunction ofreshare2package is leveraged to reshape data based on id variables"activityName"and
"subjectId"against all measurement values variables to createfinalDataMeltdata frame.melttakes wide-format data and melts it into long-format data.finalDataMeltdata frame has 679734 observations of 4 variables
Using melt
The below code shows data transformation done by using melt function to create finalDataMelt data frame
> head(finalDataMelt)
activityName subjectId variable value
1 WALKING 26 tBodyAccMeanX 0.2314146
2 WALKING 29 tBodyAccMeanX 0.3312213
3 WALKING 29 tBodyAccMeanX 0.3755700
4 WALKING 29 tBodyAccMeanX 0.2332297
5 WALKING 29 tBodyAccMeanX 0.2362494
6 WALKING 29 tBodyAccMeanX 0.2645428
>
Using dcast
- lastly,
dcastfunction ofreshap2package is leveraged for creating new tidy data.frame calledavgSujectActivities - This dataset provides average of each measurement variable for each activity and each subject
dcasttakes long-format data and casts it into wide-format data.
Tidy data frame
- The result od
dcastoperation is the tidy data frameavgSujectActivities - This data frame has
180observations/rows and68columns/variables 68columns(2columns foractivityNameandsubjectIDand66columns for measurement variables)- Each measurement variable columns
[3 to 68]is average value for each combination ofsubjectIdandactivityName
The below code shows few details of the Tidy data frame
> head(avgSujectActivities[1:7])
activityName subjectId tBodyAccMeanX tBodyAccMeanY tBodyAccMeanZ tBodyAccStdX tBodyAccStdY
1 LAYING 1 0.2215982 -0.04051395 -0.1132036 -0.9280565 -0.8368274
2 LAYING 2 0.2813734 -0.01815874 -0.1072456 -0.9740595 -0.9802774
3 LAYING 3 0.2755169 -0.01895568 -0.1013005 -0.9827766 -0.9620575
4 LAYING 4 0.2635592 -0.01500318 -0.1106882 -0.9541937 -0.9417140
5 LAYING 5 0.2783343 -0.01830421 -0.1079376 -0.9659345 -0.9692956
6 LAYING 6 0.2486565 -0.01025292 -0.1331196 -0.9340494 -0.9246448
>
Tidy Data File
- The
avgSujectActivitiesdata frame is written to a file usingwrite.tablefunction with"\t"separator to
createavgSujectActivities.txtfile - By default column names are kept in file. Row Names have to be explicity excluded using
row.names=FALSEargument
inwrite.tablefunction
Description of variables in the Tidy Data
| Variable Name | Details |
|---|---|
activityName |
Factor with 6 levels WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS, SITTING, STANDING, LAYING |
subjectId |
Each row identifies the subject who performed the activity for each window sample. Its range is from 1 to 30 |
tBodyAccMeanX |
Average of Mean Value time doman Body Accelration in x direction |
tBodyAccMeanY |
Average of Mean Value time doman Body Accelration in Y direction |
tBodyAccMeanZ |
Average of Mean Value time doman Body Accelration in Z direction |
tBodyAccStdX |
Average of Standard deviation time doman Body Accelration in x direction |
tBodyAccStdY |
Average of Standard deviation time doman Body Accelration in Y direction |
tBodyAccStdZ |
Average of Standard deviation time doman Body Accelration in Z direction |
tGravityAccMeanX |
Average of Mean Value time doman Gravity Accelrationin x direction |
tGravityAccMeanY |
Average of Mean Value time doman Gravity Accelrationin Y direction |
tGravityAccMeanZ |
Average of Mean Value time doman Gravity Accelrationin Z direction |
tGravityAccStdX |
Average of Standard deviation time doman Gravity Accelrationin x direction |
tGravityAccStdY |
Average of Standard deviation time doman Gravity Accelrationin Y direction |
tGravityAccStdZ |
Average of Standard deviation time doman Gravity Accelrationin Z direction |
tBodyAccJerkMeanX |
Average of Mean Value time doman Body Accelration Jerk in x direction |
tBodyAccJerkMeanY |
Average of Mean Value time doman Body Accelration Jerk in Y direction |
tBodyAccJerkMeanZ |
Average of Mean Value time doman Body Accelration Jerk in Z direction |
tBodyAccJerkStdX |
Average of Standard deviation time doman Body Accelration Jerk in x direction |
tBodyAccJerkStdY |
Average of Standard deviation time doman Body Accelration Jerk in Y direction |
tBodyAccJerkStdZ |
Average of Standard deviation time doman Body Accelration Jerk in Z direction |
tBodyGyroMeanX |
Average of Mean Value time doman Body Gyro in x direction |
tBodyGyroMeanY |
Average of Mean Value time doman Body Gyro in Y direction |
tBodyGyroMeanZ |
Average of Mean Value time doman Body Gyro in Z direction |
tBodyGyroStdX |
Average of Standard deviation time doman Body Gyro in x direction |
tBodyGyroStdY |
Average of Standard deviation time doman Body Gyro in Y direction |
tBodyGyroStdZ |
Average of Standard deviation time doman Body Gyro in Z direction |
tBodyGyroJerkMeanX |
Average of Mean Value time doman Body Gyro Jerk signal in x direction |
tBodyGyroJerkMeanY |
Average of Mean Value time doman Body Gyro Jerk signal in Y direction |
tBodyGyroJerkMeanZ |
Average of Mean Value time doman Body Gyro Jerk signal in Z direction |
tBodyGyroJerkStdX |
Average of Standard deviation time doman Body Gyro Jerk signal in x direction |
tBodyGyroJerkStdY |
Average of Standard deviation time doman Body Gyro Jerk signal in Y direction |
tBodyGyroJerkStdZ |
Average of Standard deviation time doman Body Gyro Jerk signal in Z direction |
tBodyAccMagMean |
Average of Mean Value time doman Body Accelration magnitude |
tBodyAccMagStd |
Average of Standard deviation time doman Body Accelration magnitude |
tGravityAccMagMean |
Average of Mean Value time doman Gravity Accelration magnitude |
tGravityAccMagStd |
Average of Standard deviation time doman Gravity Accelration magnitude |
tBodyAccJerkMagMean |
Average of Mean Value time doman Body Accelration jerk magnitude |
tBodyAccJerkMagStd |
Average of Standard deviation time doman Body Accelration jerk magnitude |
tBodyGyroMagMean |
Average of Mean Value time doman Body Gyro magnitude |
tBodyGyroMagStd |
Average of Standard deviation time doman Body Gyro magnitude |
tBodyGyroJerkMagMean |
Average of Mean Value time doman Body Gyro Jerk magnitude |
tBodyGyroJerkMagStd |
Average of Standard deviation time doman Body Gyro Jerk magnitude |
fBodyAccMeanX |
Average of Mean Value frequency domainBody Accelration in x direction |
fBodyAccMeanY |
Average of Mean Value frequency domainBody Accelration in Y direction |
fBodyAccMeanZ |
Average of Mean Value frequency domainBody Accelration in Z direction |
fBodyAccStdX |
Average of Standard deviation frequency domainBody Accelration in x direction |
fBodyAccStdY |
Average of Standard deviation frequency domainBody Accelration in Y direction |
fBodyAccStdZ |
Average of Standard deviation frequency domainBody Accelration in Z direction |
fBodyAccJerkMeanX |
Average of Mean Value frequency domainBody Accelration Jerk in x direction |
fBodyAccJerkMeanY |
Average of Mean Value frequency domainBody Accelration Jerk in Y direction |
fBodyAccJerkMeanZ |
Average of Mean Value frequency domainBody Accelration Jerk in Z direction |
fBodyAccJerkStdX |
Average of Standard deviation frequency domainBody Accelration Jerk in x direction |
fBodyAccJerkStdY |
Average of Standard deviation frequency domainBody Accelration Jerk in Y direction |
fBodyAccJerkStdZ |
Average of Standard deviation frequency domainBody Accelration Jerk in Z direction |
fBodyGyroMeanX |
Average of Mean Value frequency domainBody Gyro in x direction |
fBodyGyroMeanY |
Average of Mean Value frequency domainBody Gyro in Y direction |
fBodyGyroMeanZ |
Average of Mean Value frequency domainBody Gyro in Z direction |
fBodyGyroStdX |
Average of Standard deviation frequency domainBody Gyro in x direction |
fBodyGyroStdY |
Average of Standard deviation frequency domainBody Gyro in Y direction |
fBodyGyroStdZ |
Average of Standard deviation frequency domainBody Gyro in Z direction |
fBodyAccMagMean |
Average of Mean Value frequency domainBody Accelration magnitude |
fBodyAccMagStd |
Average of Standard deviation frequency domainBody Accelration magnitude |
fBodyBodyAccJerkMagMean |
Average of Mean Value frequency domainBody Accelration jerk magnitude |
fBodyBodyAccJerkMagStd |
Average of Standard deviation frequency domainBody Accelration jerk magnitude |
fBodyBodyGyroMagMean |
Average of Mean Value frequency domainBody Body Gyro magnitude |
fBodyBodyGyroMagStd |
Average of Standard deviation frequency domainBody Body Gyro magnitude |
fBodyBodyGyroJerkMagMean |
Average of Mean Value frequency domainBody Body Gyro jerk magnitude |
fBodyBodyGyroJerkMagStd |
Average of Standard deviation frequency domainBody Body Gyro jerk magnitude |