Found a corrupt STDF, and I'm told these might come in regularly...

So I'm in the validation phase of my converter and I found 1 stdf file that my parser can't handle.  I've spent a large portion of my week looking into this file... trying to understand why LinqToStdf can't parse it correctly but two other parsers I'm using can.  One of those parsers is QuickEdit. The other was written by someone in my company so I had a chat with him about this.

Basically what happens is, some of these STDFs are coming from questionable sources and can't be relied on fully. There is no way around this. This STDF file has an insane amount of DTRs, which is normal for this product.  Something happened while this file was being written and in the middle of a DTR, 4096 zeros are written to disk. I haven't figured out yet if the DTR was interrupted and continued its data after the string of zeros ended or if that cluster on disk was corrupt and the DTR would have ended somewhere in the string of zeros.  I attempted to remove the 4096 zeros using EmEditor's binary mode, but it didn't help. Either I did that wrong or it wasn't the first theory.  Tomorrow I'm going to try to delete most of the zeros, but leave enough so the correct number of bytes are there for the string.

This is what the DTR's text field looks when I export it to JSON - 
"Text": "[103 characters redacted]\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000"

I believe the \u0000's equate a single 0 each, which makes the field 245 characters long and within the 255 character limit.

The guy who wrote the parser I'm replacing said that with his, it attempts to read a record, but if it can't parse it correctly, it will go back to the end of the previous record, advance 1 byte, then try again. If it fails again, it goes back again and advances 2 bytes... It continues until it actually gets something it can make sense of.  That seems like a pretty serious change for this package, but I'm not sure I see another way around this.  Since the length is first in the header, maybe it's possible to have it check if the length is zero, then if so assume corruption? Like, can a record actually be a length of zero?  If not then this specific issue might be easier to fix than I thought, but this would assume that when the data stops being zeros, it's actually the start of a real record and not in the middle of a different record.  Also, since the length is two bytes, the first time it gets a value I suppose it'd have a 50% chance of being right because you don't know if the first byte of the U*2 was actually a zero or not... Also, this would potentially only fix this specific issue.  Having it be able to do what I said in the start of this paragraph would be way better.  

A little more about this file and what I've done...

When LinqToStdf gets to this bad section, it reads in the DTR, reads in enough zeros to finish where it thinks that record should end, then the next 988 records after that are read in as unknown records of type 0, subtype 0, and length 0. After it runs out of 0s, I believe it's assuming it's starting off at a record header. It sees a type/subtype it doesnt understand, reads in the amount of bytes it's told to read in from the record length field, then moves on under the assumption that the next byte is the start of the next record.  The only reason I noticed was because one of these random type/subtypes it read in matched an SBR record type.  It tried to parse what it thought was an SBR and failed because the C*n at the end of the record claimed to be longer than the number of bytes left in that record.
This is what the order of the records look like in the file. it's formatted as [record short name]([record type]-[record subtype])([quantity of this type of record in a row]) 
DTR(50-30)(6528)
UNKNOWN(0-0)(988)
?UNKNOWN(85-90)(1)
?UNKNOWN(110-52)(1)
?UNKNOWN(107-53)(1)
?UNKNOWN(103-68)(1)
?UNKNOWN(57-89)(1)
?UNKNOWN(122-120)(1)
?UNKNOWN(56-89)(1)
?UNKNOWN(120-109)(1)
?UNKNOWN(76-84)(1)
?UNKNOWN(107-114)(1)
?UNKNOWN(75-70)(1)
?UNKNOWN(74-111)(1)
?UNKNOWN(61-32)(1)
?UNKNOWN(73-72)(1)
?UNKNOWN(104-116)(1)
?UNKNOWN(53-122)(1)
?UNKNOWN(86-104)(1)
?UNKNOWN(101-109)(1)
?UNKNOWN(108-97)(1)
?UNKNOWN(116-73)(1)
 SBR(1-50)(1)
?UNKNOWN(83-83)(1)
?UNKNOWN(97-99)(1)
?UNKNOWN(79-73)(1)
?UNKNOWN(102-120)(1)
?UNKNOWN(106-85)(1)
?UNKNOWN(47-118)(1)

When I disable SBR parsing, SBRs are treated as unknown records so it doesn't error out at that record. What happens next is it seems to randomly land on the start of a DTR, parses it correctly, and continues on fine. With SBR parsing disabled, it seemed like my converter successfully converted the file, but it actually left out a die because a PRR/PIR was missed.  When I inspect the stdf file with EmEditors binary mode, these unknown records contain DTRs, PTRs, a PIR, and a PRR.  When I stick this file into QuickEdit, write it out again, then input that file into my converter, it finds the correct number of die. 

So I have a manual workaround for the issue, but I can't say how often the stdfs for this product are going to be like this. I had access to the most recent 40ish stdfs and they were all fine so that is promising, but if it happens once, especially during a validation phase, you can put money on the fact that it's going to happen again.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Found a corrupt STDF, and I'm told these might come in regularly... #30

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Found a corrupt STDF, and I'm told these might come in regularly... #30

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions