Difference between revisions of "Creating an artificial test set using emulation"

From CURATEcamp
Jump to: navigation, search
(Plan for the Hackathon day)
 
(6 intermediate revisions by the same user not shown)
Line 8: Line 8:
  
 
# WHILE STATUS = AWAKE
 
# WHILE STATUS = AWAKE
## Pick a software application and parameter to use to create files and add the details here (link to google spreadsheet to be added) e.g. WordStar 7 for dos, "default save parameter (wordstar 7.0 format)" + details of user doing work.  
+
## Pick a software application and parameter to use to create files and add the details [https://docs.google.com/spreadsheet/ccc?key=0AjoFh8b0kxFndGdfWVRUZEllY1c5a3NiUi14cGRMb3c here]  e.g. WordStar 7 for DOS, "default save parameter (wordstar 7.0 format)" + details of user doing work.  
 
## Set up emulation environment and install software.
 
## Set up emulation environment and install software.
 
## Get content to include -- copy to text file or text, copy to csv/tab delimited file for structured data, bmp/tiff for images and add to virtual disk file that can be attached to the emulated environment.
 
## Get content to include -- copy to text file or text, copy to csv/tab delimited file for structured data, bmp/tiff for images and add to virtual disk file that can be attached to the emulated environment.
 
## Boot environment and create file using selected parameter. Ensure all content & significant properties are included.
 
## Boot environment and create file using selected parameter. Ensure all content & significant properties are included.
 
## Take screenshots to ensure significant properties are thoroughly documented
 
## Take screenshots to ensure significant properties are thoroughly documented
## Transfer files into host OS
+
## Transfer files into host OS, you can mount the disk image in windows using [http://www.osforensics.com/tools/mount-disk-images.html this] free tool.
 
## Upload files and documentation (including screenshots and list of included properties/content) to the [https://github.com/openplanets/format-corpus Github repository]
 
## Upload files and documentation (including screenshots and list of included properties/content) to the [https://github.com/openplanets/format-corpus Github repository]
 
## GOTO 1.1
 
## GOTO 1.1
Line 45: Line 45:
  
  
== Content to include (significant properties) ==
+
== Content to include ==
  
 
Ideally every possible type of content should be included in each file and multiple instances of each in different configurations. This would ensure comprehensive testing options. It would also be useful to ensure content is not repeated in the test files so that it is easy to identify where in the file the content came from.  
 
Ideally every possible type of content should be included in each file and multiple instances of each in different configurations. This would ensure comprehensive testing options. It would also be useful to ensure content is not repeated in the test files so that it is easy to identify where in the file the content came from.  
A list of potential types of content is included in Appendix 4 of the [http://archives.govt.nz/resources/information-management-research/rendering-matters-report-results-research-digital-object-r Rendering Matters Report]
+
 
 +
=== Content sources ===
 +
# Can use the Statistics New Zealand [http://www.stats.govt.nz/tools_and_services/services/schools_corner/SURF%20for%20schools/census.aspx Census 2006 Synthetic Unit Record File] for Spreadsheet data.
 +
# Can use the [http://archives.govt.nz/resources/information-management-research/rendering-matters-report-results-research-digital-object-r Rendering Matters Report] for document data.
 +
 
 +
=== Significant Properties ===
 +
 
 +
A list of potential types of content (Significant Properties) is included in Appendix 4 of the [http://archives.govt.nz/resources/information-management-research/rendering-matters-report-results-research-digital-object-r Rendering Matters Report]
 
These were described in question form in the report and were as follows (could do with sorting into document, data, slide set etc):
 
These were described in question form in the report and were as follows (could do with sorting into document, data, slide set etc):
  
=== Word Processing file (document) significant properties ===
+
==== Word Processing file (document) significant properties ====
 +
 
 
# Macros or scripts
 
# Macros or scripts
 
# Links to external files (not hyperlinks)
 
# Links to external files (not hyperlinks)
Line 115: Line 123:
 
## Multi-level numbered lists
 
## Multi-level numbered lists
 
# Tables
 
# Tables
## Table formatting
+
## Formatted tables
 
### Borders
 
### Borders
### Border colours
+
#### Coloured borders
### Cell colours
+
#### Borders of varying weights (thickness)
### Border weight (thickness)
+
### Coloured cells
 
# Custom views
 
# Custom views
 
# Custom shapes
 
# Custom shapes
Line 140: Line 148:
  
  
=== Spreadsheet file (structured data) significant properties ===
+
==== Spreadsheet file (structured data) significant properties ====
 +
 
 +
# Macros or scripts
 +
# Links to external files (not hyperlinks)
 +
# Hyperlinks
 +
# Internal links
 +
# Editing restrictions
 +
# Edit history (track changes)
 +
# Embedded Metadata
 +
## author's name
 +
## date created
 +
## date last edited
 +
## time spent authoring
 +
# Text Formatting
 +
## (multiple obscure) fonts
 +
## Bold text
 +
## Italic text
 +
## Underline text
 +
## Superscript text
 +
## Subscript text
 +
## Coloured text
 +
## Text with coloured background (highlighting)
 +
## ...
 +
# Specific page dimensions
 +
## A4 Page dimensions
 +
## US Letter page dimensions
 +
## A3 page dimensions
 +
## ...
 +
# Document templates
 +
## Envelope template
 +
## Letter template
 +
## ...
 +
# Images
 +
## Image size on screen
 +
## Image colours
 +
## Image resolution
 +
## Image Dimensions
 +
## Image positioning
 +
### In line with text
 +
### Behind Text
 +
### In front of text
 +
### Square
 +
### Tight
 +
### Through
 +
### Top and bottom
 +
## Image file format
 +
### JPG (v x)
 +
### TIFF  (v x)
 +
### BMP
 +
### ...
 +
# Content positioned in varying places on a sheet
 +
# Text across multiple line
 +
# New line characters
 +
# Special Symbols
 +
# Page breaks
 +
# Section breaks
 +
# Indented text
 +
# Bullet Points
 +
## Bullet points with special symbols (e.g. diamonds, dashes, dots)
 +
# Numbered lists
 +
## Multi-level numbered lists
 +
# Custom views
 +
# Custom shapes
 +
# Hidden content
 +
# Watermarks
 +
# Custom character sets
 +
# Custom languages/language interfaces (e.g. Romanian spell check)
 +
# A specific number of words
 +
# A specific number of words as reported by the software
 +
# Page numbers
 +
# An embedded current-date object
 +
# Mail-merge settings
 +
# Comments
 +
# Coloured cells
 +
# Cell borders
 +
## Coloured Borders
 +
## Borders of varying weights
 +
# Conditional formatting
 +
# Columns with a specific order
 +
# Rows with a specific order
 +
# Functions
 +
## sum()
 +
## average()
 +
## ........ there are a lot of possibilities here and some will be problematic
 +
# Pivot tables
 +
# Named cells
 +
# Named ranges
 +
# Cells of different types
 +
## Text-type cells
 +
## Short-date-type cells
 +
## Number-2dp-type cells
 +
## ....
 +
# Applied filters
 +
# Links to other data sources
 +
# Multiple worksheets
 +
# Chart/Graphs
 +
## Charts with varying layouts
 +
## Charts with titles
 +
## Charts with labeled axes
 +
## Charts with specific axes ratios
 +
 
 +
==== Presentation file significant properties ====
 +
 
 +
 
  
=== Presentation file significant properties ===
 
  
  

Latest revision as of 04:36, 16 November 2012

Main Page > CURATEcamp iPRES 2012 > CURATEcamp 24 hour worldwide file id hackathon Nov 16 2012

Files for use in testing format ID tools are most useful if they have a known source and known content. It can be hard to source test files that have a known source and known content that are also free to use. A simple way to solve this problem would be to create files using original software using emulation or virtualisation software to run the original software.

Plan for the Hackathon day

  1. WHILE STATUS = AWAKE
    1. Pick a software application and parameter to use to create files and add the details here e.g. WordStar 7 for DOS, "default save parameter (wordstar 7.0 format)" + details of user doing work.
    2. Set up emulation environment and install software.
    3. Get content to include -- copy to text file or text, copy to csv/tab delimited file for structured data, bmp/tiff for images and add to virtual disk file that can be attached to the emulated environment.
    4. Boot environment and create file using selected parameter. Ensure all content & significant properties are included.
    5. Take screenshots to ensure significant properties are thoroughly documented
    6. Transfer files into host OS, you can mount the disk image in windows using this free tool.
    7. Upload files and documentation (including screenshots and list of included properties/content) to the Github repository
    8. GOTO 1.1
  2. END

Software used for creating files

euanc: I have the following available immediately (others on request):

  1. Microsoft Excel 97 SR-1 97 SR-1 Windows 98 SE (4.10.2222 A)
  2. Microsoft Word 97 SR-1 97 SR-1 Windows 98 SE (4.10.2222 A)
  3. Paradox 7 Version 7.0 Windows 98 SE (4.10.2222 A)
  4. Microsoft Access for Windows 95 Version 7.00 Windows 95
  5. Microsoft Excel for Windows 95 Version 7.0 Windows 95
  6. Microsoft Word for Windows 95 Version 7.0 Windows 95
  7. Microsoft Works Word Processor Version 4.0 for Windows 95 4 Windows 95
  8. Corel WordPerfect Version 6.1 for Windows Windows 3.1
  9. Freelance Graphics for Windows Release 2.1 Windows 3.1
  10. Microsoft Access Version 2.00 Version 2.00 Windows 3.1
  11. Microsoft Excel Version 5.0c Version 5.0c Windows 3.1
  12. Microsoft Word Version 6.0c Version 6.0c Windows 3.1
  13. Quttro Pro 6.02 Version 6.02 Windows 3.1
  14. WordPerfect for Windows Version 5.2 Windows 3.1
  15. Ashton Tate DBASE IV Verison IV MS-DOS 6.22
  16. Framework II II MS-DOS 6.22
  17. Microsoft Word 5.5 MS-DOS 6.22
  18. WordPerfect 5.1 + 5.1 MS-DOS 6.22
  19. WordStar for DOS North American Version 7.0 Rev. A 7.0 Rev. A MS-DOS 6.22
  20. Microsoft Word 2000 9.0.3821 SR-1 Microsoft Windows 2000 5.00.2195 Service Pack 4

Parameters to use (formats to create)

Content to include

Ideally every possible type of content should be included in each file and multiple instances of each in different configurations. This would ensure comprehensive testing options. It would also be useful to ensure content is not repeated in the test files so that it is easy to identify where in the file the content came from.

Content sources

  1. Can use the Statistics New Zealand Census 2006 Synthetic Unit Record File for Spreadsheet data.
  2. Can use the Rendering Matters Report for document data.

Significant Properties

A list of potential types of content (Significant Properties) is included in Appendix 4 of the Rendering Matters Report These were described in question form in the report and were as follows (could do with sorting into document, data, slide set etc):

Word Processing file (document) significant properties

  1. Macros or scripts
  2. Links to external files (not hyperlinks)
  3. Hyperlinks
  4. Internal links
  5. Editing restrictions
  6. Edit history (track changes)
  7. Embedded Metadata
    1. author's name
    2. date created
    3. date last edited
    4. time spent authoring
  8. Text Formatting
    1. (multiple obscure) fonts
    2. Bold text
    3. Italic text
    4. Underline text
    5. Superscript text
    6. Subscript text
    7. Coloured text
    8. Text with coloured background (highlighting)
    9. ...
  9. Specific page dimensions
    1. A4 Page dimensions
    2. US Letter page dimensions
    3. A3 page dimensions
    4. ...
  10. Document templates
    1. Envelope template
    2. Letter template
    3. ...
  11. Images
    1. Image size on screen
    2. Image colours
    3. Image resolution
    4. Image Dimensions
    5. Image positioning
      1. In line with text
      2. Behind Text
      3. In front of text
      4. Square
      5. Tight
      6. Through
      7. Top and bottom
    6. Image file format
      1. JPG (v x)
      2. TIFF (v x)
      3. BMP
      4. ...
  12. Text crossing multiple pages (pagination)
  13. Content positioned in varying places on a page
  14. Text across multiple line
  15. Specific line spacing
  16. New line characters
  17. Special Symbols
  18. Equations
  19. Page breaks
  20. Section breaks
  21. Indented text
  22. Bullet Points
    1. Bullet points with special symbols (e.g. diamonds, dashes, dots)
  23. Numbered lists
    1. Multi-level numbered lists
  24. Tables
    1. Formatted tables
      1. Borders
        1. Coloured borders
        2. Borders of varying weights (thickness)
      2. Coloured cells
  25. Custom views
  26. Custom shapes
  27. Hidden content
  28. Watermarks
  29. Custom character sets
  30. Custom languages/language interfaces (e.g. Romanian spell check)
  31. A specific number of words
  32. A specific number of words as reported by the software
  33. Footnotes
  34. Endnotes
  35. Page numbers
  36. References
  37. An embedded current-date object
  38. Borders
  39. Citations
  40. Mail-merge settings
  41. Comments


Spreadsheet file (structured data) significant properties

  1. Macros or scripts
  2. Links to external files (not hyperlinks)
  3. Hyperlinks
  4. Internal links
  5. Editing restrictions
  6. Edit history (track changes)
  7. Embedded Metadata
    1. author's name
    2. date created
    3. date last edited
    4. time spent authoring
  8. Text Formatting
    1. (multiple obscure) fonts
    2. Bold text
    3. Italic text
    4. Underline text
    5. Superscript text
    6. Subscript text
    7. Coloured text
    8. Text with coloured background (highlighting)
    9. ...
  9. Specific page dimensions
    1. A4 Page dimensions
    2. US Letter page dimensions
    3. A3 page dimensions
    4. ...
  10. Document templates
    1. Envelope template
    2. Letter template
    3. ...
  11. Images
    1. Image size on screen
    2. Image colours
    3. Image resolution
    4. Image Dimensions
    5. Image positioning
      1. In line with text
      2. Behind Text
      3. In front of text
      4. Square
      5. Tight
      6. Through
      7. Top and bottom
    6. Image file format
      1. JPG (v x)
      2. TIFF (v x)
      3. BMP
      4. ...
  12. Content positioned in varying places on a sheet
  13. Text across multiple line
  14. New line characters
  15. Special Symbols
  16. Page breaks
  17. Section breaks
  18. Indented text
  19. Bullet Points
    1. Bullet points with special symbols (e.g. diamonds, dashes, dots)
  20. Numbered lists
    1. Multi-level numbered lists
  21. Custom views
  22. Custom shapes
  23. Hidden content
  24. Watermarks
  25. Custom character sets
  26. Custom languages/language interfaces (e.g. Romanian spell check)
  27. A specific number of words
  28. A specific number of words as reported by the software
  29. Page numbers
  30. An embedded current-date object
  31. Mail-merge settings
  32. Comments
  33. Coloured cells
  34. Cell borders
    1. Coloured Borders
    2. Borders of varying weights
  35. Conditional formatting
  36. Columns with a specific order
  37. Rows with a specific order
  38. Functions
    1. sum()
    2. average()
    3. ........ there are a lot of possibilities here and some will be problematic
  39. Pivot tables
  40. Named cells
  41. Named ranges
  42. Cells of different types
    1. Text-type cells
    2. Short-date-type cells
    3. Number-2dp-type cells
    4. ....
  43. Applied filters
  44. Links to other data sources
  45. Multiple worksheets
  46. Chart/Graphs
    1. Charts with varying layouts
    2. Charts with titles
    3. Charts with labeled axes
    4. Charts with specific axes ratios

Presentation file significant properties

        • below to be removed****
  1. Are there macros or scripts in the digital object?
  2. Are there any links in the file to other files?
  3. Can the Macros or Scripts be executed?
  4. Are the links to external files still working?
  5. Are there any editing restrictions on the object?
  6. Have the restrictions been maintained?
  7. What type of rendering is being observed in this test?
  8. Does the object contain an edit history?
  9. Has the edit history been maintained?
  10. Is there metadata embedded in the file such as the author's name, date saved, amount of time spent authoring, etc?
  11. Has the embedded metadata been maintained?
  12. Are any/all fonts being fully and accurately rendered?
  13. Has the text formatting been maintained? e.g. bold, italic, underline, superscript, sub script or strike-through?
  14. Is there text formatting included in the object, e.g. bold, italics, underline, strike-through, subscript or superscript?
  15. Does the object have text of any colour other than black?
  16. Has the text colour been maintained?
  17. Does the object include highlighted text?
  18. Has the highlighted text been maintained?
  19. Have the page dimensions been maintained?
  20. Has the pagination been maintained?
  21. Has the position on screen of content been maintained?
  22. Has the position of content on the page been maintained?
  23. Has line spacing been maintained?
  24. Have the new-lines been correctly placed?
  25. Have page and section breaks been maintained?
  26. Has the orientation of objects/text been maintained?
  27. Has the justification of text been maintained?
  28. Has any extra/additional information/data been added to the object that is observable by the user?
  29. Does the object contain images?
  30. Has the image orientation and position been maintained?
  31. Has the image size been maintained?
  32. Have the colours of the image been maintained?
  33. Has the resolution of the image been maintained?
  34. Does the object include custom views?
  35. Have custom views been maintained?
  36. Does the object include custom shapes?
  37. Have custom shapes been maintained?
  38. Does the object include hidden content?
  39. Has the hidden content been maintained?
  40. Does the object include watermarks?
  41. Have the watermarks been maintained?
  42. Does the object include custom character sets?
  43. Have the custom character sets been maintained?
  44. Does the object include any custom languages or language interfaces?
  45. Have the custom languages or language interfaces been maintained?
  46. Has the number of words reported by the software been maintained?
  47. Has the actual number of words in the document been maintained?
  48. Does the document have footnotes or endnotes?
  49. Have the footnotes or endnotes been maintained?
  50. Does the document have an embedded object that adds the current-date to the object?
  51. Has the embedded date been maintained? (please comment)
  52. Does the document have internal links within it?
  53. Have the internal links been maintained?
  54. Does the document include lists or bullet points?
  55. Have the lists or bullet points been maintained?
  56. Have the list or bullet point symbols been maintained?
  57. Have the tables been maintained?
  58. Has the table formatting/layout been maintained?
  59. Are there borders within the document?
  60. Have the borders been maintained?
  61. Are there citations in the document?
  62. Have the citations been maintained?
  63. Are there mail-merge settings applied in the document?
  64. Have the mail-merge settings been maintained?
  65. Does the document include comments?
  66. Have the comments been maintained?
  67. Are there formulae in the object?
  68. Have the formulae been maintained?
  69. Has the notation language of the formulae been maintained?
  70. Has the way rounding is calculated been maintained?
  71. Has the number of decimal places displayed been maintained?
  72. Does the object have any internal links?
  73. Have the internal links been maintained?
  74. Does the object have an embedded object that adds the current-date to the object?
  75. Has the embedded date been maintained? (please comment)
  76. Does the object have coloured cells?
  77. Have the cell colours been maintained?
  78. Does the object include cells with borders?
  79. Have the cell borders been maintained?
  80. Does the object include any conditional formatting?
  81. Has the conditional formatting been maintained?
  82. Has the column order been maintained?
  83. Has the row order been maintained?
  84. Have functions been maintained? E.g. standard deviation?
  85. Does the object include pivot tables?
  86. Have the pivot tables been maintained?
  87. Does the object include hidden rows or columns?
  88. Have the hidden rows or columns been maintained?
  89. Does the object include named cells?
  90. Have the cell names been maintained?
  91. Does the object include named ranges?
  92. Have the name ranges been maintained?
  93. Have cell types been maintained? E.g. number, text or date
  94. Does the object include any applied filters?
  95. Have the applied filters been maintained?
  96. Does the object include links to other data sources?
  97. Have the links to other data sources been maintained?
  98. Does the object include multiple worksheets?
  99. Have all of the worksheets been maintained?
  100. Has the embedded date been maintained? (please comment)
  101. Has the shading or colours been maintained?
  102. Has the layout been maintained?
  103. Does the graph include a title?
  104. Has the tile been maintained?
  105. Does the graph include labels on axes or data points?
  106. Have the labels been maintained?
  107. Have the proportions and/or ratios of the axes been maintained?
  108. Does the graph include the ability to view the data source(s)?
  109. Has the ability to view the data source(s) been maintained?
  110. Does the Presentation include animated (or other) slide transitions?
  111. Have the slide transitions been maintained?
  112. Does the presentation include audio or video?
  113. Is the audio or video still renderable?
  114. Has the interface/presentation mode of the slides been maintained, for example the click to do x or wait 2 seconds before x?
  115. Does the presentation have an embedded function that adds the current-date to the object?
  116. Has the embedded date been maintained? (please comment)
  117. Does the database have a custom front-end, interface or form(s)?
  118. Has the ability to render and interact with the custom front-end been maintained?
  119. Does the database include saved queries?
  120. Have the queries been maintained?
  121. Has the internal structure been maintained (e.g. primary keys, links between tables etc).
  122. Does the database include links to other data sources?
  123. Have the links to other data sources been maintained?
  124. Does the database include any custom views?
  125. Have the custom views been maintained?
  126. Does the database have an embedded function that adds the current-date to the object?
  127. Has the embedded date been maintained? (please comment)
  128. Has all useful functionality in the object been maintained?
  129. Are there any other changes to the object that have not been identified in other questions?