Creating an artificial test set using emulation

From CURATEcamp
Revision as of 14:18, 13 November 2012 by Euan Cochrane (talk | contribs) (Plan for the Hackathon day)
Jump to: navigation, search

Main Page > CURATEcamp iPRES 2012 > CURATEcamp 24 hour worldwide file id hackathon Nov 16 2012

Files for use in testing format ID tools are most useful if they have a known source and known content. It can be hard to source test files that have a known source and known content that are also free to use. A simple way to solve this problem would be to create files using original software using emulation or virtualisation software to run the original software.

Plan for the Hackathon day

  1. WHILE STATUS = AWAKE
    1. Pick a software application and parameter to use to create files and add the details here (link to google spreadsheet to be added) e.g. WordStar 7 for dos, "default save parameter (wordstar 7.0 format)" + details of user doing work.
    2. Set up emulation environment and install software.
    3. Get content to include -- copy to text file or text, copy to csv/tab delimited file for structured data, bmp/tiff for images and add to virtual disk file that can be attached to the emulated environment.
    4. Boot environment and create file using selected parameter. Ensure all content & significant properties are included.
    5. Take screenshots to ensure significant properties are thoroughly documented
    6. Transfer files into host OS
    7. Upload files to the Github repository
    8. GOTO 2
  2. END

Software used for creating files

Parameters to use (formats to create)

Content to include (significant properties)

Ideally every possible type of content should be included in each file and multiple instances of each in different configurations. This would ensure comprehensive testing options. It would also be useful to ensure content is not repeated in the test files so that it is easy to identify where in the file the content came from. A list of potential types of content is included in Appendix 4 of the Rendering Matters Report These were described in question form in the report and were as follows (could do with sorting into document, data, slide set etc):

  1. Digital object type:
  2. Digital Object ID (if available)
  3. Computer File ID From DROID
  4. Test Application ID (repeated below)
  5. Will the object render?/Will the file open at all in the application?
  6. Are there macros or scripts in the digital object?
  7. Are there any links in the file to other files?
  8. Can the Macros or Scripts be executed?
  9. Are the links to external files still working?
  10. Are there any editing restrictions on the object?
  11. Have the restrictions been maintained?
  12. What type of rendering is being observed in this test?
  13. Does the object contain an edit history?
  14. Has the edit history been maintained?
  15. Is there metadata embedded in the file such as the author's name, date saved, amount of time spent authoring, etc?
  16. Has the embedded metadata been maintained?
  17. Are any/all fonts being fully and accurately rendered?
  18. Has the text formatting been maintained? e.g. bold, italic, underline, superscript, sub script or strike-through?
  19. Is there text formatting included in the object, e.g. bold, italics, underline, strike-through, subscript or superscript?
  20. Does the object have text of any colour other than black?
  21. Has the text colour been maintained?
  22. Does the object include highlighted text?
  23. Has the highlighted text been maintained?
  24. Have the page dimensions been maintained?
  25. Has the pagination been maintained?
  26. Has the position on screen of content been maintained?
  27. Has the position of content on the page been maintained?
  28. Has line spacing been maintained?
  29. Have the new-lines been correctly placed?
  30. Have page and section breaks been maintained?
  31. Has the orientation of objects/text been maintained?
  32. Has the justification of text been maintained?
  33. Has any extra/additional information/data been added to the object that is observable by the user?
  34. Does the object contain images?
  35. Has the image orientation and position been maintained?
  36. Has the image size been maintained?
  37. Have the colours of the image been maintained?
  38. Has the resolution of the image been maintained?
  39. Does the object include custom views?
  40. Have custom views been maintained?
  41. Does the object include custom shapes?
  42. Have custom shapes been maintained?
  43. Does the object include hidden content?
  44. Has the hidden content been maintained?
  45. Does the object include watermarks?
  46. Have the watermarks been maintained?
  47. Does the object include custom character sets?
  48. Have the custom character sets been maintained?
  49. Does the object include any custom languages or language interfaces?
  50. Have the custom languages or language interfaces been maintained?
  51. Has the number of words reported by the software been maintained?
  52. Has the actual number of words in the document been maintained?
  53. Does the document have footnotes or endnotes?
  54. Have the footnotes or endnotes been maintained?
  55. Does the document have an embedded object that adds the current-date to the object?
  56. Has the embedded date been maintained? (please comment)
  57. Does the document have internal links within it?
  58. Have the internal links been maintained?
  59. Does the document include lists or bullet points?
  60. Have the lists or bullet points been maintained?
  61. Have the list or bullet point symbols been maintained?
  62. Have the tables been maintained?
  63. Has the table formatting/layout been maintained?
  64. Are there borders within the document?
  65. Have the borders been maintained?
  66. Are there citations in the document?
  67. Have the citations been maintained?
  68. Are there mail-merge settings applied in the document?
  69. Have the mail-merge settings been maintained?
  70. Does the document include comments?
  71. Have the comments been maintained?
  72. Are there formulae in the object?
  73. Have the formulae been maintained?
  74. Has the notation language of the formulae been maintained?
  75. Has the way rounding is calculated been maintained?
  76. Has the number of decimal places displayed been maintained?
  77. Does the object have any internal links?
  78. Have the internal links been maintained?
  79. Does the object have an embedded object that adds the current-date to the object?
  80. Has the embedded date been maintained? (please comment)
  81. Does the object have coloured cells?
  82. Have the cell colours been maintained?
  83. Does the object include cells with borders?
  84. Have the cell borders been maintained?
  85. Does the object include any conditional formatting?
  86. Has the conditional formatting been maintained?
  87. Has the column order been maintained?
  88. Has the row order been maintained?
  89. Have functions been maintained? E.g. standard deviation?
  90. Does the object include pivot tables?
  91. Have the pivot tables been maintained?
  92. Does the object include hidden rows or columns?
  93. Have the hidden rows or columns been maintained?
  94. Does the object include named cells?
  95. Have the cell names been maintained?
  96. Does the object include named ranges?
  97. Have the name ranges been maintained?
  98. Have cell types been maintained? E.g. number, text or date
  99. Does the object include any applied filters?
  100. Have the applied filters been maintained?
  101. Does the object include links to other data sources?
  102. Have the links to other data sources been maintained?
  103. Does the object include multiple worksheets?
  104. Have all of the worksheets been maintained?
  105. Has the embedded date been maintained? (please comment)
  106. Has the shading or colours been maintained?
  107. Has the layout been maintained?
  108. Does the graph include a title?
  109. Has the tile been maintained?
  110. Does the graph include labels on axes or data points?
  111. Have the labels been maintained?
  112. Have the proportions and/or ratios of the axes been maintained?
  113. Does the graph include the ability to view the data source(s)?
  114. Has the ability to view the data source(s) been maintained?
  115. Does the Presentation include animated (or other) slide transitions?
  116. Have the slide transitions been maintained?
  117. Does the presentation include audio or video?
  118. Is the audio or video still renderable?
  119. Has the interface/presentation mode of the slides been maintained, for example the click to do x or wait 2 seconds before x?
  120. Does the presentation have an embedded function that adds the current-date to the object?
  121. Has the embedded date been maintained? (please comment)
  122. Does the database have a custom front-end, interface or form(s)?
  123. Has the ability to render and interact with the custom front-end been maintained?
  124. Does the database include saved queries?
  125. Have the queries been maintained?
  126. Has the internal structure been maintained (e.g. primary keys, links between tables etc).
  127. Does the database include links to other data sources?
  128. Have the links to other data sources been maintained?
  129. Does the database include any custom views?
  130. Have the custom views been maintained?
  131. Does the database have an embedded function that adds the current-date to the object?
  132. Has the embedded date been maintained? (please comment)
  133. Has all useful functionality in the object been maintained?
  134. Are there any other changes to the object that have not been identified in other questions?